◈ X-Research

X Research — Autonomous Work Operating Guide

How to organize overnight and unsupervised Claude Code sessions so they produce useful results without drifting, duplicating work, hiding failures, or depending on improvisational context.

Last updated: 2026-05-01


Core Model: Bounded Deterministic Runs

Core Model: Bounded Deterministic Runs

Every autonomous session is a bounded run with a defined scope, explicit stopping conditions, and file-based state that survives context compaction. The agent reads a plan, executes tasks from it, writes progress to durable files, and stops when the plan is done or a limit is hit.

This is not a suggestion. The alternative — an agent improvising its own task sequence in a long context window — produced 42 runs in one night, a production data overwrite, empty heartbeat state, and no compact checkpoint summary. It worked by luck, not design.

The Three Laws

  1. State lives in files, not context. Context compacts. Files don't. Every piece of state the next session needs must be in a file before this session ends.
  2. Scope is pre-defined and closed. The session prompt lists exactly which tasks to do and explicitly names what is out of scope. If a task isn't listed, it doesn't happen.
  3. Progress is machine-verifiable. "It looks done" is not done. Each task has acceptance criteria that a script or grep can check.

Seva's Autonomy Economics

Seva's Autonomy Economics

Autonomous work on this project optimizes for Seva's time and system reliability, not for minimizing model tokens, API cost, or the agent's aesthetic sense of efficiency.

Intention:

  1. Optimize Seva's time, not tokens. If autonomous work can convert waiting, supervision, or morning debugging into tested artifacts, it is usually worth doing.
  2. Prefer extra reversible work over lost time. It is better to do one safe extra experiment, test, refactor, rehearsal, or fixture and later revert it than to stop early and discover the gap when Seva is waiting on the system.
  3. Reliability is empirical. A pipeline is not reliable because one or two happy-path checks passed. Especially with LLM processing, reliability is learned by repeated runs, adversarial cases, fixtures, dry-runs, and boring evidence. Nobody runs the first 10% of evals and declares the system done.
  4. Autonomous AI pipeline development is like an autonomous engineering team. Individual tasks can be done, but useful work remains in research, refactoring, testing, observability, cleanup, rehearsal, and optimization.
  5. Improvement goals can be unbounded by design. Goals like “make the system generate posts better and better” can be improved indefinitely. That is normal; bounded slices keep the work safe, not finite.
  6. If ideas run out, launch a reviewer/planner. When the worker cannot find the next useful bounded slice, start a separate fresh AI review to inspect the current state and propose the next plan instead of idling.
  7. Risk ownership follows Seva's explicit instruction. If Seva says “100 runs” or “work until HH:MM”, the agent should execute that directive within the declared safety boundaries. Do not silently substitute token/cost conservation for the user's chosen risk posture.

Operational consequence: green checks are a baseline for the next slice, not an automatic stop condition before the requested stop time or before a result-driven goal is actually DONE. The supervisor should continue finding useful bounded work inside the declared goals until the stop time/result, a real blocker, repeated non-progress, or a safety boundary.

Result-driven continuation rule (added 2026-05-11). When Seva says a run is result-driven, the supervisor must not convert PARTIAL + decision-bearing into a resting state. A decision item is a prompt to narrow risk, run deployed/operator-path audits, launch a fresh reviewer, create rollback-safe assumptions, or ask one precise blocking question — not a stop condition by itself. Stop only when every safe next slice is blocked by a named hard-stop class: safety boundary, missing credential/access, repeated non-progress, explicit user stop, or a concrete Seva decision that blocks all safe investigation/verification/repair.

Case rule from 2026-05-01: completing every named stage in the initial checklist does not satisfy a “work until HH:MM” instruction. After the Launch Mafia stages 1-10 were completed around 04:00, the supervisor incorrectly stopped even though Seva had requested work until 11:00. The correct behavior was to continue with deployment verification, browser E2E, daily-cycle integration, hardening, reviewer passes, and adversarial/recovery testing until the stop time. Do not repeat this failure.


Session Structure

Session Structure

Before Launch (human or supervisor prepares)

  1. Write one active sprint prompt using the template below. Default path for this campaign shape: prompts/autonomous-YYYY-MM-DD-<slug>.md (not multiple handoff files). Older prompts are Evidence after the run; do not load them by default.
  2. Anchor the session to Operator Workflows. The prompt names the GOALS-PIPELINE.md §4a workflow(s), the OJ-* rows from docs/specs/operator-journeys.md, and the exact first proof/gate being built or repaired. The current X Research campaign starts with runner wiring, then OJ-005 Rewrite / Rework Without Losing Context, then OJ-001/OJ-004 as the graph allows.
  3. Define task list — specific task IDs from the launch prompt's sprint brief or a reliability plan. Maximum 10 tasks per session. (Active sprint context lives in the launch prompt / heartbeat / supervisor state, not a canonical doc.)
  4. Set stopping conditions: task count, CC run count (max 15), wall-clock time, or "Phase N complete."
  5. Verify prerequisites: OAuth tokens valid, API keys present, no conflicting parallel runs.

During Execution (agent follows)

  1. Read plan first. Before any work, read the session prompt, OPERATING-PRINCIPLES.md, and the latest checkpoint summary at data/run-state/<latest>.json.
  2. One task at a time. Complete, verify, record, then move to next. No interleaving.
  3. Write checkpoints to data/run-state/YYYY-MM-DD.json after each task completion: json { "session_date": "2026-04-29", "tasks_completed": ["phase1-diagnose", "phase2-repair"], "tasks_remaining": ["phase3-health", "phase4-docs"], "cc_runs": 7, "last_completed_at": "2026-04-29T06:30:00Z", "failures": [], "decisions": ["skipped Gate 5 browser test — no approved items in queue"], "artifacts_created": ["data/run-reports/2026-04-29-phase1.json"] }
  4. Stop at limits. When the task list is done, the run count limit is hit, or a blocking failure occurs — stop. Write a final checkpoint summary. Do not invent new work.

When the Task List Ends Before the Stop Time

The task list is the current bounded workfront, not the ceiling of useful work. If all listed tasks are complete before the stop time:

  1. Return to the session goal and DoD. Do not stop just because the immediate checklist is green.
  2. Ask what is still useful inside the goal boundary. Look for the next safest bounded slice: rehearsal, adversarial testing, fixture coverage, live-boundary hardening, recovery drills, operator visibility, documentation that makes the cycle more reliable, or a fresh reviewer pass.
  3. Do not expand beyond the goal. Infinite goals are not a license for unrelated features. New work must directly serve the declared session goal and remain inside explicit out-of-scope limits.
  4. Create the next mini-task list. Write 3-7 concrete tasks with acceptance checks, update the checkpoint, then continue.
  5. Stop only at stop time, real blocker, safety boundary, or repeated non-progress. If Seva gave an explicit stop time, green checks and an empty checklist are not a stop condition. Keep selecting the next useful bounded slice inside the goal until the stop time arrives.

Forbidden phrasing before stop time: do not write or act on “mission complete”, “full autonomous cycle closed”, “all stages done”, or equivalent as a stopping reason while the stop time has not arrived. At most, write “initial checklist complete; continuing with hardening/reviewer/integration slices until HH:MM.”

In short: tasks are finite; goals guide continuation. The machine should not confuse an empty checklist with completed responsibility. That would be almost human.

Gap-Review Escalation Policy (added 2026-04-29)

When Claude Code reports "all done" and an independent verifier (Marvin or another fresh CC session) confirms green checks while time still remains before the stop time, do not idle. Escalate in this exact order:

  1. Launch a fresh gap-review session. A fresh CC session with no prior context reads the goal/DoD and looks for gaps the original worker missed: missing tests, missing docs, edge cases, regressions, ambiguous artifacts, drift between code and canonical state, smoke coverage holes. If the gap review finds work, do it inside the session goal and verify.
  2. Repeat up to three gap-review passes. Three independent passes is the cap before declaring the immediate goal genuinely saturated. Each pass must run with a fresh context window so it does not inherit the previous pass's blind spots.
  3. After three clean passes, move one level up. Inside the same session goal boundary, work on one of: reliability, quality, refactoring, optimization, cleanup, script clarity/currency, observability, pipeline debuggability, agent-confusion reduction (clearer artifacts, clearer canonical pointers, clearer error messages, fewer foot-guns). Pick the slice with the highest expected payoff for Seva's time.
  4. Hard stops still apply. Stop time, declared safety boundary, real blocker, or repeated non-progress on the same slice always overrides this escalation policy. The point is to never silently idle when there is non-trivial useful work still inside the goal boundary; it is not a license to invent unrelated features.

Operational consequence: a verified-green checkpoint mid-session is a baseline, not a terminator. The supervisor relaunches gap-review or higher-level work until stop time, then writes the final checkpoint summary.

Explicit stop-time override. When Seva says “work until HH:MM”, that instruction overrides the agent's temptation to enter a stable resting state. Until that time, the supervisor should relaunch work whenever no worker is active, using the goal/DoD to select the next bounded slice. Green verification is a baseline, not an exit condition.

Stop time limits new launches, not active workers. When Seva says “work until HH:MM”, the deadline means: keep launching new bounded iterations until that time. It does not mean kill, cancel, or artificially constrain workers that were already launched before the deadline. Existing Claude Code/run-task workers should finish naturally, write their artifacts/checkpoints, and deliver their wake/result unless a genuine safety issue appears. Cutting off an active worker wastes the work already paid for and can break result delivery.

Do not encode the stop time as a worker timeout. For autonomous work-until-time instructions, the stop time belongs to the supervisor/heartbeat relaunch policy: it controls when to stop launching new iterations. Do not pass a short run-task.py --timeout just to match the stop time. A worker launched before the deadline should be allowed to finish naturally unless there is a genuine safety/resource issue. Use worker timeouts only for real runaway protection, and make them comfortably longer than the expected useful work window.

Stop-Time Supervisor Contract (added 2026-05-01)

When a user gives a stop time, the supervisor must maintain an explicit relaunch contract outside the worker prompt. A worker can finish; the supervisor is responsible for noticing there is still time left.

Required supervisor state:

{
  "stop_time_local": "2026-05-01T11:00:00-07:00",
  "goal": "Launch Mafia end-to-end system",
  "launches_until_stop": true,
  "initial_checklist_complete": false,
  "post_checklist_mode": "hardening_review_integration",
  "last_worker_finished_at": "...",
  "next_slice": "...",
  "stop_reason": null
}

Rules:

  1. launches_until_stop: true means no idle resting state before stop_time_local. If no worker is active, launch the next bounded slice unless stopped by a real blocker/safety boundary/repeated non-progress.
  2. initial_checklist_complete changes the mode, not the stop condition. Switch to post_checklist_mode and continue with hardening, browser E2E, integration, reviewer, fixture, deploy-verification, observability, or recovery work inside the same goal.
  3. Write an explicit stop_reason. Valid reasons before stop time are only: blocked, safety_boundary, repeated_non_progress, or explicit_user_stop. checklist_complete is invalid before stop time.
  4. Use a wake safety net. For long autonomous windows, create at least one precise cron/system wake near the next expected decision point or a periodic wake until the stop time. Heartbeat alone is useful but not a durable supervisor contract.
  5. On every wake, compare now vs stop time before deciding. If now < stop time and no worker is active, the default action is to launch useful bounded work, not to summarize success.

If this feels repetitive, good. Repetition is how robots avoid making the same stupid mistake twice, a bar we occasionally clear.

Visible goal tree before every Claude Code launch (added 2026-05-11)

Between every Claude Code launch in an autonomous run, the supervisor must write a visible, compact goal tree before starting the next worker. This is required even when the next slice feels obvious.

Purpose:

  1. Human monitoring. Seva can see which branch of the run is active without reconstructing state from logs.
  2. Debuggability. Later postmortems can tell whether the supervisor was optimizing the right goal or just choosing the easiest reversible task.
  3. Attention steering. The act of marking the active branch forces the supervisor and next worker to orient to the run goal, not merely the latest technical subtask.

Required shape (adapt labels to the active run, keep it short):

Goal tree before next Claude Code launch:
- G1 docs/architecture — PARTIAL/DONE/BLOCKED — current evidence
  - [ ] sub-branch not active
  - [x] active branch: <the slice being launched now>
- G2 tests — PARTIAL/DONE/BLOCKED — current evidence
- G3 fresh-CC bootstrap — PARTIAL/DONE/BLOCKED — current evidence
- G4 product/operator result — PARTIAL/DONE/BLOCKED — current evidence

Active branch now: G2 → L4 test-account E2E → refresh token + rerun wrapper
Why this branch: lowest-scored non-blocked goal / required acceptance gate
Expected evidence: <report path, command, or proof class>

Rules:

  1. Print the tree before launch, not after. It must be a visible decision turn between the previous worker result and the next run-task.py start. Silent relaunch is forbidden.
  2. Mark exactly one active branch. If multiple goals are tied, choose one and state why. Do not launch a worker with an ambiguous objective like “continue improvements.”
  3. Name the acceptance gate. The tree must say which gate the slice advances (G1 graph, G2 L4, G3 fresh probe, G4 deployed operator path, etc.) and what evidence will prove progress.
  4. Use the active scoreboard as source of truth. The statuses must match AGENTS.md §0, HEARTBEAT.md, and data/run-state/*.json. If they do not match, the active branch is reconciliation, not product work.
  5. Keep it compact. This is not a status essay. Four goal rows plus the active branch/reason/evidence is enough.
  6. Reflect blockers honestly without stopping early. If the active branch exposes a Seva decision but the product goal is still PARTIAL, mark the decision-bearing part and then choose the next safe audit/reviewer/proof/repair branch that narrows the decision. Only mark the whole run BLOCKED when every safe next branch is blocked by the same concrete decision or safety/access boundary.

If a wake launches Claude Code without this visible tree, treat that as a supervisor-process defect and record it in the run report. If a wake stops while any product goal is PARTIAL without naming a hard-stop class, treat that as a supervisor-process defect too. The whole point is to make the run inspectable while it is still happening, before everyone has to become an archaeologist.

After Completion (human or supervisor reviews)

  1. Read the checkpoint file and final summary.
  2. Verify claims. Agent said it fixed X? Check the file. Agent said tests pass? Run them.
  3. Update the run checkpoint summary with verified status (agent's marks are [?] UNVERIFIED until confirmed).

Supervisor verification rule (no PASS without operator-path gate)

When CC reports done on operator-visible work, the supervisor MUST independently verify the operator-path gate from GOALS-PIPELINE.md §3 item 0 before accepting PASS. The verification path is:

  1. Identify the operator-path gate for this slice. UI work → a click-E2E that performs the actual operator gesture (Approve/Publish/Reject/Rewrite) on a real-shaped row against the production deployment, asserting DOM + counts + Supabase row + audit trail. Worker/script work → a curl/script call against the production API followed by a Supabase/canonical read confirming the mutation. Pure-library work → reach the gate via at least one downstream surface.
  2. Run the gate yourself, or launch a fresh verifier CC session whose only job is to run it. Do not accept the original worker's claim that "smoke passes" as substitute. Render-and-filter, DOM-presence-only, fixture replays, and helper-function smokes are pre-flight, not acceptance.
  3. If no operator-path gate exists for this slice, create one or mark the slice BLOCKED, not PASS. The absence of a runnable acceptance check is itself a defect; do not paper over it with "looks correct."
  4. Use the Phase 1A dreaming report (data/run-reports/2026-05-08/dreaming-history-root-docs-proposal.md) as evidence pointer, not as required reading for every supervisor wake — its durable lessons are now distilled into GOALS-PIPELINE.md, OPERATING-PRINCIPLES.md, ARCHITECTURE.md, and this file. Open the report only when an evidence trail to a specific 2026-05-08 incident is needed.

Forbidden phrasing before the gate: "looks right", "looks correct", "appears to work", "behaves as expected", "verified by smoke", or "Claude says done" used as substitutes for the operator-path gate. PASS may be written only after either (a) a click-E2E or scripted operator-path call has been executed and asserted, or (b) the report explicitly states "operator-path not yet executed" and the supervisor has accepted the deferral with a follow-up task.

Production-deploy acceptance check (added 2026-05-11)

When the run names a product-deliverable goal — for example Seva says "ideal morning result", "open the interface and it works", "deploy", "live", or "I should see it in the UI" — source-only checks cannot close the goal.

  1. Run-finishing PASS requires deployed evidence. At least one deploy must contain the relevant slice. If no deploy happened, the slice is not in the surface Seva will open; the product goal is PARTIAL regardless of green smokes.
  2. Re-run the operator-path gate against the deployment that contains the slice. Local smokes/builds do not transfer to a new deployment. The proof must run against the URL named in the report.
  3. Classify the proof level. Use docs/reference/testing/TEST-DEPLOY-GATE.md L0-L5. A report that says PASS without naming the level is incomplete.
  4. Rollback is part of supervision. If the new deploy breaks a previously passing production-OJ scenario, rollback per docs/reference/deploy/VERCEL-DEPLOY-CHECKLIST.md or mark BLOCKED and escalate.
  5. Explicit no-deploy scope must be honest. If the launch prompt forbids deploy, the supervisor must report which goals required deploy and remain undelivered. No-deploy is not goal-closed.

Fresh-CC documentation acceptance check (added 2026-05-11)

When the run names documentation reliability — for example "fresh Claude Code should be able to", "test the documentation", or "give it only AGENTS.md" — documentation is not PASS until fresh workers prove it.

  1. Schedule fresh-CC bootstrap probes during the run. Default cadence: one after G1 looks complete on paper, one mid-run, one before stop time.
  2. Use a new clean Claude Code session. No resume, no inherited transcript. Give it only the repo path and a task; do not spoon-feed the file list or commands beyond what AGENTS.md provides.
  3. Record trip points in data/run-reports/<date>/fresh-cc-bootstrap-loop.md: wrong files read, missing context questions, parallel implementation patches, wrong tests, accidental top-level status docs, or confusion about repo structure.
  4. Repair docs in the same run, then repeat. A single failed probe is useful evidence, not a reason to stop.
  5. G3 PASS requires a trailing clean probe. Until one fresh session bootstraps with zero trip points, documentation reliability is PARTIAL, even if canonical docs look good to the current supervisor.

Canonical docs are delegated product authority (added 2026-05-11)

For autonomous work, GOALS-PIPELINE.md, ARCHITECTURE.md, and docs/specs/operator-journeys.md are treated as Seva-approved product direction. If those canonical docs define the goal, target architecture, taxonomy, operator journey, or acceptance evidence, the supervisor must not ask Seva to re-decide it. Execute it.

Decision policy:

Automatically agreed / delegated to Marvin when present in canonical docs:

  • Target taxonomy, state vocabulary, filters, source names, and status views.
  • Operator journeys (OJ-*), target/current surfaces, capability/module mappings, and proof levels.
  • Migration direction and staged rollout path, including compatibility bridges and fallback/debug surfaces.
  • Canonical modules/primitives and anti-duplication requirements.
  • Test/acceptance gates, smoke/E2E/script names, proof classes, and deployed-operator-path requirements.
  • Safe reversible implementation defaults needed to satisfy the above: aliases before renames, read-time compatibility before data migration, additive UI before destructive removal, fallback routes before retirement, rollback plans before deploy.
  • Non-live-X product UI/API/script changes inside the documented scope, including small product decisions needed to make the operator path work.
  • Documentation/run-state/spec updates required to keep the traceability graph honest.

Not automatically agreed: live public X actions on the main account. Everything else inside the documented scope is delegated to Marvin: if it breaks, recover via git, rollback, regeneration, data restore, or a follow-up deploy. Use care and rollback plans, but do not escalate merely because a change is destructive-looking, schema-related, security-adjacent, or product-significant when it is still inside the canonical goals/architecture/scenarios.

  1. Default to autonomous resolution. Small and medium product/implementation choices inside the documented goals are Marvin's responsibility. Choose the safest reversible option that best satisfies the canonical goal/architecture/scenario, document the rationale, test it, and continue.
  2. Use the docs as the tie-breaker. When options differ, prefer the one that most directly preserves the documented North Star, DoD, target architecture, taxonomy invariants, operator journey, and rollback safety.
  3. Ask Seva only for main-account live public X actions or out-of-scope direction changes. Escalate when the next action would perform live public X actions on Seva's main account, or when the work would intentionally change the documented North Star/scope instead of implementing it. Otherwise choose a reversible/recoverable path and continue.
  4. Do not label canonical implementation work as a blocker. If GOALS-PIPELINE.md, ARCHITECTURE.md, or operator-journeys.md already says the target, remaining uncertainty is implementation work, not a Seva decision.
  5. When uncertain, launch analysis instead of asking. If the docs are ambiguous, first launch a fresh reviewer/deep-analysis Claude Code session to compare options against the canonical docs and recommend the safest path. Ask Seva only if that analysis shows a true unresolved scope/risk decision.
  6. One-question rule. If escalation is truly necessary, ask exactly one concise blocking question and include the default recommendation that will be taken if Seva delegates the decision back.

Product must not be left disassembled (added 2026-05-11)

For product/operator runs, the worst outcome is not extra token spend; it is Seva returning to a product that is half-migrated, old paths broken, new paths unproven, and the agent resting because it encountered a decision. That is a supervisor failure.

Rules:

  1. Optimize Seva's time, not tokens. Extra safe verification, reviewer passes, deployed audits, rollback preparation, and small repairs are preferred over stopping early and making Seva reconstruct product state.
  2. Never rest on a knowingly disassembled product. If old behavior may be broken and the replacement is not deployed/proven, the next slice is audit/repair/proof, not summary/stop.
  3. Scope belongs to Seva; implementation belongs to Marvin. Seva defines the goal and risk posture. Marvin owns intermediate implementation/product decisions inside that scope, including choosing safe reversible defaults, testing them, deploying when the goal requires it, and rolling back/repairing when evidence says so.
  4. Decision complexity triggers analysis, not paralysis. If a product choice feels too important for one supervisor judgment, launch an independent Claude Code reviewer/deep-analysis session to compare options and recommend a path, then continue with the safest reversible implementation.
  5. Ask only for truly blocking decisions. A question is blocking only when the remaining action is a live public X action on the main account or an explicit change to Seva's declared scope/North Star. Destructive-looking or schema/product-significant work inside scope should proceed with rollback/recovery evidence, not escalation.
  6. Partial product state requires active supervision. If the product goal is PARTIAL, heartbeat should keep relaunching bounded slices with visible goal trees until DONE or a named hard-stop class.

Goal-closure slice rule (added 2026-05-11)

Every bounded slice must name the acceptance gate it advances before it starts. "Small and reversible" is necessary but not sufficient. If a slice advances no acceptance gate, it must explicitly say why it is still needed and name the next slice that will advance a gate. A phase that only improves observability/static guardrails cannot be the final phase while any active-run goal remains NOT_STARTED or PARTIAL.

Premature-stop guard (added 2026-05-11). If the supervisor believes it should stop while a product goal is PARTIAL, it must first launch or explicitly rule out, with evidence, these continuation classes: deployed regression audit, old-vs-new surface comparison, operator-path E2E for the changed surface, fresh reviewer/gap review, smallest rollback-safe repair, and documentation/state reconciliation. If any class is still safe and relevant, stopping is invalid; heartbeat should relaunch with a visible goal tree.

Concept-duplication scan (added 2026-05-08)

Before accepting PASS, the supervisor MUST also confirm the slice did not introduce a parallel implementation of an existing concept. The check is:

  1. Identify whether the slice touched a concept governed by ARCHITECTURE.md §11 (tweet rendering, status badge, time formatting, cockpit SSR query, kind/cap/sentinel/idempotency-key/heartbeat/Supabase-client/error-normalization registries, X-API publish, smoke harness). If yes, continue; otherwise skip.
  2. Verify the change landed in the canonical module named in §11, not in a hand-rolled local copy on the touched surface. A new local fmtTs, DAILY_CAPS, TERMINAL_STATUSES, inline select().eq('status', 'pending,running'), or hand-rolled X-API call is a duplication. Approving such a slice is approving a future incident.
  3. If a parallel implementation was introduced, require either consolidation in the same slice or an explicit deprecation/migration plan named in the report. Without one, mark the slice BLOCKED, not PASS. "Temporary parallel implementation" without a closure date is how triplication happens.
  4. Use the Phase 3 dreaming report (data/run-reports/2026-05-08/dreaming-phase3-deeper-root-causes.md) as the evidence pointer for the 13 known duplication classes (D1–D13); do not reread it every wake. Its durable rules are distilled into OPERATING-PRINCIPLES.md §0b and ARCHITECTURE.md §11.

The two checks (operator-path gate and concept-duplication scan) compose: a slice that fixes a bug only on one of three duplicated surfaces will pass the gate today and reproduce the bug on a sibling surface tomorrow. PASS requires both.

Live E2E test-account residue check (added 2026-05-09)

When the slice ran or could have run live writes against @sevaustinovtest (see OPERATING-PRINCIPLES.md §0c and ARCHITECTURE.md §12), the supervisor adds a third check before accepting PASS:

  1. Confirm the test account is clean at rest. The ledger's e2e:<test_run_id>:* write rows MUST each have a paired undo row. Any unpaired write row indicates residue; mark BLOCKED, not PASS, and trigger the residue-recovery sweep (or manual cleanup if the sweep is not yet implemented).
  2. Confirm no production-account writes occurred. Inspect ledger_events for the test run window: every mode='real' row MUST carry target_account='@sevaustinovtest'. A real write against any other account is a P0 contract violation; halt, do not accept PASS, escalate to Seva.
  3. Confirm X_E2E_LIVE_ARMED and the production sentinels were not co-armed. MAFIA_LIVE_ARMED / ORIGINALS_LIVE_ARMED and X_E2E_LIVE_ARMED MUST NOT have been set in the same process. If process telemetry shows co-arming, the run is invalid regardless of what it claimed to have done.

These three checks compose with the operator-path gate and the concept-duplication scan. Until the §12 wiring exists, this section is the early warning: any worker that claims a "live E2E" run prior to the wiring is by definition a contract violation and cannot be PASSed.

Spec-coverage check (added 2026-05-09)

When the slice introduces or modifies an operator-visible feature or a shared primitive, the supervisor MUST also confirm docs/specs/<feature>.md exists, was updated in the same slice, and lists smokes/E2Es that verify the changed contract. If no spec exists, create one (or mark the slice BLOCKED, not PASS) per GOALS-PIPELINE.md §3 item 12 and ARCHITECTURE.md §17. PASS without a spec for a feature is identical to PASS without operator-path evidence: the contract is undocumented and the next regression has no test.

Load-budget check (added 2026-05-22)

When the slice adds or changes a hot-path Supabase query (any SSR query on force-dynamic operator surfaces, polling API endpoint ≤ 60 s cadence, launchd worker poll, or cron read), the supervisor MUST also confirm node scripts/smoke-supabase-egress-budget.mjs exits 0 AND the slice's report cites that exit code or a fresh scripts/supabase-usage-snapshot.mjs proof. Functional + cockpit-shape PASS over UNKNOWN load is the false-done class that produced the 2026-05-21 exceed_egress_quota restriction. PASS/PARTIAL/FAIL semantics + typed blockers: docs/specs/supabase-load-budget.md. Per GOALS-PIPELINE.md §3 item 16 + OPERATING-PRINCIPLES.md §0a item 5.


Session Prompt Template

Session Prompt Template

# Autonomous X Research Session — [DATE]

## Goal
[One sentence: what this session accomplishes]

## Tasks (in order)
1. [Task ID] — [one-line description] — acceptance: [how to verify]
2. [Task ID] — [one-line description] — acceptance: [how to verify]
...

## Stopping Conditions
- All tasks complete, OR
- [N] CC runs reached, OR
- Blocking failure on a mandatory task (document and stop)

## Out of Scope
- [Explicitly name forbidden work areas]
- No new features beyond task list
- No live X actions unless task specifically requires it

## Standing Rules
1. Script-first: no improvised agent steps. See OPERATING-PRINCIPLES.md.
2. Commit locally after each meaningful unit. Push if remote exists.
3. Write checkpoint after each task to data/run-state/YYYY-MM-DD.json.
4. No edits to SOUL.md, MEMORY.md, AGENTS.md, TOOLS.md.
5. If a task fails after one retry, document the failure and move on.

## Context Recovery (read after any compaction)
1. This file (the session prompt)
2. data/run-state/YYYY-MM-DD.json (checkpoint)
3. Checkpoint `summary` field (status summary; canonical sprint state)
4. Latest /tmp/cc-*.txt (most recent CC result)

Failure Modes and Mitigations

Failure Modes and Mitigations

Evidence-based, from our own runs and external sources.

Sprint Material Pack: Keep Canonical Guides, Summarize the Rest (added 2026-05-01)

Autonomous workers need enough context to act like owners, but a good sprint packet is not “read every artifact ever written.” Large context windows should be spent on current truth, tests, and source code — not on stale transcripts and duplicated phase reports.

Always include/read:

  1. Canonical guides and contracts — these are worth the tokens and should not be trimmed ad hoc: AGENTS.md (canonical entrypoint + launch/completion procedure), AUTONOMOUS-WORK.md, GOALS-PIPELINE.md, OPERATING-PRINCIPLES.md, ARCHITECTURE.md, docs/specs/action-catalog.md, and task-specific runbooks/readiness contracts in docs/reference/<system>/. Edit these only when Seva explicitly asks or when the task is to update the operating model.
  2. Current sprint goal / DoD — one concise statement of what “done” means now, including stop time, safety boundaries, and live-action posture.
  3. Latest state checkpoint — the newest data/run-state/* or equivalent, not every historical checkpoint.
  4. Latest accepted checkpoint summary — a compact current-state document that says what was done, what was verified, what remains, links to evidence, and exact next slice.
  5. The direct parent worker output — usually the latest /tmp/cc-*.txt referenced by the wake. Read older outputs only if the latest checkpoint summary points to a specific unresolved claim.
  6. Evidence artifacts by reference — run reports, screenshots, probe summaries, and old CC outputs should be listed with paths and one-line relevance. Read the full artifact only when it is needed for the current decision or verification.

Do not put these into every worker prompt by default:

  • Multiple old /tmp/cc-*.txt outputs whose facts have already been distilled into a current checkpoint summary.
  • Old prompts; prompts explain what was asked, not what became true. Keep path references for audit, but do not reread unless debugging prompt drift.
  • Historical run reports from previous phases when a newer accepted summary supersedes them.
  • Giant raw logs, HTML dumps, screenshots, or JSON artifacts unless the task specifically analyzes them.
  • Entire ROADMAP/PROGRESS history if the current sprint packet has a precise goal and the task only needs a narrow subsystem. Prefer targeted excerpts or explicit sections.

A good sprint packet should have three layers:

  1. Read fully: canonical guides + current sprint brief + latest checkpoint summary.
  2. Read selectively: task-specific source files, current tests, latest worker output, current evidence artifact.
  3. Reference only unless needed: older phase outputs, old prompts, raw logs, screenshots, legacy reports.

When a worker finishes, it should create or update one compact checkpoint summary that supersedes its raw output. The next worker should usually read that compact checkpoint summary and the latest direct output, not the entire phase archaeology. If the compact checkpoint summary is missing, create it before launching more work. This is how we avoid spending 100k tokens proving that text can be heavy.

Compaction Recovery: Read Handoff Documents, Not Just Bootstrap (added 2026-05-01)

After context compaction, restart, or a wake that resumes an autonomous window, the supervisor must not rely on the compacted chat summary as the source of truth. The compacted summary is only a pointer. Before making the next decision, read the durable checkpoint/run-report artifacts for the active workstream.

Required recovery sequence:

  1. Read AGENTS.md for the reading list plus launch/completion procedure, then the canonical docs it points to.
  2. Read AUTONOMOUS-WORK.md for current autonomous rules.
  3. Read the active session control file, usually HEARTBEAT.md, if the work is supervised by heartbeat/cron.
  4. Read the latest checkpoint/state file for the workstream (data/run-state/*.json, project-specific checkpoint, or equivalent).
  5. Read the latest relevant Claude Code outputs referenced by the wake, checkpoint, or HEARTBEAT.md (/tmp/cc-*.txt).
  6. Read the latest checkpoint/run-report artifacts created by the workers (data/run-reports/*, docs/*RUNBOOK*, readiness docs, gap analyses, production probe summaries, etc.).
  7. Only then decide whether to accept, relaunch, escalate, or stop.

Rule of thumb: if a worker or supervisor wrote a file specifically so future-you can continue, future-you must read it after compaction. Otherwise the file is just a tiny monument to wasted effort, and we have enough monuments already.

For stop-time autonomous windows, this rule is mandatory before deciding that a worker's result is “done.” The supervisor must compare the checkpoint/run-report artifacts against the original user goal and the stop-time contract, not against the compressed chat memory.

0. Premature Stop After Checklist Completion

What happened: On 2026-05-01, Launch Mafia autonomous mode was explicitly set to run until 11:00 PT. The initial stages 1-10 completed around 04:00 PT. Marvin treated that as “mission complete”, wrote a final checkpoint, and stopped launching work. Heartbeat should have continued with hardening/reviewer/integration slices, but the guide allowed the phrase “mission incomplete” to be misread as “named checklist complete = done”.

Root causes: - Stop-time instruction was not represented as durable supervisor state. - Heartbeat wording used “if mission is incomplete” instead of “if before stop time and no worker active”. - No cron safety-net wake existed (cron list was empty), so the relaunch contract depended on ambiguous heartbeat behavior. - The supervisor confused a green implementation checkpoint with an end-of-window checkpoint summary.

Mitigation: - Stop-time override is absolute for new launches until the deadline. - checklist_complete is not a valid stop reason before the deadline. - After checklist completion, switch to post-checklist mode: gap review, hardening, browser E2E, integration, deploy verification, recovery drills, observability, cleanup, and docs. - Long autonomous windows require a durable wake safety net in addition to heartbeat. - Final summaries before the stop time must include the next slice launch, not end the work.

1. Context Loss After Compaction

What happened: Marvin's HEARTBEAT.md was empty during a 42-run session. After compaction, no mechanism to recover current goal, standing rules, or completion state. (Postmortem 2026-04-28, finding #1)

External parallel: Anthropic's harness guide (2025-11): "Each new session begins with zero memory of what came before." Ralph loop (Knightli, 2026-04): forces context rotation before degradation, persists state in files and git.

Mitigation: Checkpoint files + context recovery protocol in every session prompt. State lives in data/run-state/, not LLM context.

2. Scope Creep / Unbounded Task Expansion

What happened: Roadmap recommended 7 CC sessions. Overnight cycle completed all 7 and continued into M-tier and L-tier items without evaluation of diminishing returns. (Postmortem finding #6)

External parallel: Stanford SWE-chat study (2026-04): "More autonomy does not translate into more efficient delivery." MindStudio patterns: "Mistakes compound undetected across multiple steps in headless mode."

Mitigation: Session prompts have explicit task lists and run-count limits (max 15). Scope is closed — unlisted tasks don't happen.

3. Production Data Overwrite

What happened: M-2 task overwrote data/morning-packet-2026-04-28.md with a sparse placeholder via a --out flag parsing bug. Caught by QA, but shows corruption risk. (Postmortem finding #7)

External parallel: Vectara awesome-agent-failures: Replit AI deleted production database during code freeze then generated fake data to cover it. Amazon Kiro deleted production AWS environment (13-hour outage).

Mitigation: Write steps are fail-closed. Scripts that write production data must support --dry-run. Agent never overwrites existing production files without explicit task instruction.

4. Retry/Cost Runaway

What happened (external): Agent burned $4,200 in 63 hours hitting rate-limit errors ~4,800 times/hour with no budget gate. (Sattyam Jain postmortem, 2026-04)

Mitigation: Budget guards on four dimensions: token ceiling per task, wall-clock time per session, CC run count limit, and max 1 retry per failed task.

5. Approval Fatigue

What happened (external): Anthropic sandboxing paper (2025-10): users stop scrutinizing permission prompts. HITL patterns study: reviewer quality drops 40% after 2-hour shifts.

Mitigation: Automated containment over human vigilance. --dry-run default, fail-closed writes, scoped permissions. Human reviews artifacts after the session, not individual prompts during it.

6. Silent Completion / Invisible Results

What happened: CC tasks completed but results weren't surfaced to supervisor. Telegram wake delivery was non-deterministic. (Memory 2026-04-28: "processed but invisible" failure)

Mitigation: Deterministic notification delivery. Every task writes to a checkpoint file. Final summary written to a known path. Supervisor reads checkpoint, doesn't depend on notification.


Artifact Contracts

Artifact Contracts

Every autonomous session must produce these artifacts and no extra permanent progress files:

Artifact Path When
Session prompt prompts/autonomous-YYYY-MM-DD-<slug>.md Before launch; one active prompt per autonomous campaign
Run checkpoint data/run-state/YYYY-MM-DD-<slug>.json After each task / worker result
Sprint summary updates checkpoint JSON summary field After each task; ≤80 lines; supersedes raw worker prose
Evidence report data/run-reports/YYYY-MM-DD/<slug>.md or .json Only when a task produces durable evidence/design/proof; exact path cited from checkpoint
Temp / raw worker output /tmp/cc-*.txt, /tmp/cc-run-*.log, scratch files Process-only; read latest direct output on wake, then distill into checkpoint; do not commit
Git commits Local (push if remote exists) After each meaningful unit, once verified
End-of-session summary checkpoint JSON summary field At session end; names final status, verified gates, next slice

Artifacts that don't exist at the expected path after a session = the session failed to document itself. Treat as incomplete.

Forbidden autonomous artifacts unless explicitly requested by Seva: - root PROGRESS.md, ROADMAP.md, HANDOFF.md, NEXT.md, or timestamped root summaries; - new permanent docs/*PLAN.md / docs/*STATUS.md files for progress tracking; - multiple competing "current" prompts/checkpoints; - committed raw /tmp/cc-* outputs, logs, HTML dumps, screenshots, or huge JSON blobs.

If a worker needs a report, it writes one Evidence file under data/run-reports/YYYY-MM-DD/<descriptive-slug>.md and the checkpoint summarizes it in 1–3 bullets. The next worker reads the checkpoint first, not the evidence directory.


Artifact Lifecycle and Context Budget (added 2026-05-09)

Artifact Lifecycle and Context Budget (added 2026-05-09)

Status: [TARGET] — the lifecycle classes and budget caps below are policy. Some are partially honoured today (e.g. AGENTS.md §4 already states an ~8–12k-token bootstrap budget; the Sprint Material Pack section names what to read/skip). The mechanical enforcement (a budget-check smoke, a compaction cron, automated archival) is [TARGET]. Until enforcement lands, supervisors apply these rules during launch and review.

The problem this section addresses: 10+ hour autonomous runs accumulate checkpoint sediment fast. The 2026-04-28 overnight cycle ran 42 CC tasks; the 2026-05-08 dreaming pass already noted 126→265 run reports in a few days. Without lifecycle classes and a fixed context budget, every autonomous run grows the read-list, every checkpoint cites more files, every session takes longer to bootstrap, and signal degrades. A fixed budget plus net-zero edits prevents the drift.

Five lifecycle classes

Every file the project produces belongs to exactly one class. The class determines the size cap, the cleanup policy, and whether the file is default-loaded into agent context. Autonomous workers MUST name the class for any newly created non-code artifact in their final summary; if they cannot classify it, the artifact should not exist.

# Class Examples Default-loaded? Lifecycle
1 Canonical AGENTS.md, GOALS-PIPELINE.md, OPERATING-PRINCIPLES.md, ARCHITECTURE.md, docs/specs/action-catalog.md, docs/specs/hot-write-ledger.md, AUTONOMOUS-WORK.md (README.md is a one-line redirect to AGENTS.md, not a separate Canonical doc) yes Long-lived. Edited only with explicit reason. Updates require either Seva confirmation (AGENTS) or proposed wording in a report (others).
2 Sprint current-sprint brief, current sprint summary appended to checkpoint, prompts/overnight-YYYY-MM-DD.md yes during the sprint Lifecycle = current sprint. On sprint end, durable lessons distill into Canonical; the sprint file moves to evidence archive.
3 Run-state data/run-state/YYYY-MM-DD.json, data/run-state/YYYY-MM-DD-{task-label}.json, HEARTBEAT.md if present yes during the active run Lifecycle = active run. Final state archived after run completion; older checkpoints prune after the next successful run.
4 Evidence data/run-reports/YYYY-MM-DD/<slug>.md, data/rehearsal-reports/**, data/snapshots/**, screenshots, probe outputs, dreaming reports no (read on demand only) Permanent retention as evidence. Must NOT be loaded into agent context by default. Read only when a current canonical doc cites the report or the current task is an audit of that incident.
5 Temp / process /tmp/cc-*.txt, /tmp/cc-run-*.log, tmp/*.mjs, ad-hoc scratch files no Lifecycle = single session. Deleted at session end. The tmp/ directory is gitignored (or should be).

A file that does not fit one of these classes is itself a defect. Either reclassify it or archive it.

Per-class size and context budgets

Class Size cap Context-budget rule
Canonical — AGENTS.md ≤ 250 lines Must fit fully in default agent bootstrap (per AGENTS.md §4 Context Budget Rules). The agent reads this file top-to-bottom every session.
Redirect — README.md ≤ 5 lines One-line redirect to AGENTS.md; not a Canonical doc. Does not enter the bootstrap reading list.
Canonical — GOALS-PIPELINE.md ≤ 300 lines Loaded selectively; full read fits within bootstrap budget.
Canonical — OPERATING-PRINCIPLES.md ≤ 600 lines Currently 510. Headroom narrow; future additions require compression elsewhere in the doc.
Canonical — ARCHITECTURE.md ≤ 1000 lines The cap is intentionally larger than the other Canonical docs because §11 (concept registry), §12 (test E2E), §13 (storage architecture), §14 (unified inbox), §15 (one-op-one-script), §16 (anti-amnesia) are reference material future agents look up by section, not read top-to-bottom. The 1000-line ceiling preserves the architectural-doctrine-in-one-file design while still bounding sediment.
Canonical — AUTONOMOUS-WORK.md ≤ 600 lines Currently ~440. Same compression rule.
Canonical — docs/specs/action-catalog.md / docs/specs/hot-write-ledger.md ≤ 300 lines each Loaded only on action-specific work.
Sprint — current-sprint brief ≤ 60 lines The supervisor's working context for the active sprint. Goal + DoD + stop time + safety boundary + next slice + max one paragraph each.
Sprint — current checkpoint summary ≤ 80 lines Appended to the latest checkpoint JSON summary. Names what is done, what was verified, what remains, and the next slice. Not a transcript.
Sprint — overnight session prompt ≤ 120 lines Goal, ≤10 tasks with acceptance, stopping conditions, out-of-scope, standing rules, recovery pointers.
Run-state — checkpoint JSON ≤ 4 KB per file One object: tasks_completed, tasks_remaining, cc_runs, last_completed_at, failures, decisions, artifacts_created, summary. No transcript snippets.
Run-state — HEARTBEAT.md ≤ 60 lines Stop time, current goal, last worker finished_at, next slice, stop_reason.
Evidence — single run report ≤ 800 lines preferred; ≤ 2,000 lines hard cap If a report needs to exceed 2,000 lines, split or distill. The 2026-05-08 dreaming Phase 1A report is the upper-bound reference for "long but defensible."
Evidence — overall directory rotation policy [TARGET] data/run-reports/ and data/rehearsal-reports/ need a rotation/archive policy (Phase 1B §5). Until rotation lands, do not enumerate the directory in any prompt.
Temp — any single file ≤ 100 lines Throwaway. Larger temp content goes through a script + /tmp redirect, not committed.

These are caps, not targets. A canonical doc at 400 lines is fine; one at 595 is on the edge and the next edit must compress before adding.

Net-zero context budget rule

Status: [TARGET] for mechanical enforcement; [ENFORCED] going forward as a supervisor judgment call.

The total bootstrap context budget for the default agent reading list is fixed at the AGENTS.md §4 envelope (~8–12k tokens). To add content to a Canonical or Sprint file inside that envelope, the agent (or the supervisor approving the edit) MUST first compress or remove an equivalent number of lines/tokens elsewhere in the same file or in the bootstrap reading list. Specifically:

  1. Edits that grow a Canonical doc by N lines require at least N lines of compression in the same doc OR an explicit deletion plan for redundant content elsewhere in the bootstrap set, named in the edit's report.
  2. New Sprint files do not extend the bootstrap reading list. They REPLACE the prior sprint's brief/checkpoint summary. The prior sprint's brief moves to evidence archive in the same operation.
  3. Adding a new Canonical doc requires retiring or merging an existing Canonical doc. No silent expansion of the canonical set. The removed HANDOFF.md experiment in 2026-05-08 is the cautionary case: it grew the canonical set, was not retired explicitly, and required two separate dreaming phases to remove the routing.
  4. The Sprint Material Pack rules (above) are the read-side complement. Net-zero applies to the canonical/sprint write side; the read-side rule says do not load evidence/temp/process files unless the current decision needs them.
  5. Forbidden phrasing in edit reports: "while we're at it, also added X" / "expanded with helpful detail" / "a small note for clarity" — without a paired compression. These phrases are how canonical docs grow to 40 KB. The corrective phrasing is "added X (Y lines); compressed Z (Y lines) to maintain the budget."

This rule is the structural reason the Phase 4B status legend exists: when a contract grows from a sentence to a paragraph, the budget gets paid in concrete debt elsewhere; the legend keeps the trade visible.

Periodic compaction cadence

Status: [TARGET] — supervisor process rule. No cron exists yet. Until one does, the supervisor schedules these phases manually for runs longer than 4 hours.

For autonomous runs longer than ~4 wall-clock hours OR more than ~10 CC runs, the supervisor inserts compaction phases at the cadences below. Compaction is not optional; it is part of the run, not a follow-up.

Run length Compaction cadence What gets compacted
≤ 4 h, ≤ 10 CC runs none required rely on per-task checkpoint discipline
4–8 h, 10–20 CC runs one mid-run compaction at the ~halfway point sprint brief: trim done-and-verified tasks to a one-line ledger; checkpoint: collapse tasks_completed array into a count + last-3 detail; evidence: confirm new reports are catalogued in the supervisor summary, not pasted
8–16 h, 20–40 CC runs every ~4 h as above, plus prune any temp files older than the compaction window; verify the bootstrap reading list still fits the budget; rewrite stale checkpoint-summary sentences
> 16 h, > 40 CC runs every ~3 h as above, plus a fresh-context check: launch a verifier session with no prior context, ask it to recover the current sprint state from the canonical+sprint files only, and confirm it can. If it cannot, the sprint files are stale and the run pauses for repair before further work

Compaction phases produce one new checkpoint that supersedes the prior one. The prior checkpoint moves to evidence (data/run-state/archive/); it is not deleted. The next CC run reads the new checkpoint; it does not need the older ones.

Cleanup policy per class

Status: mostly [TARGET] — there is no scheduled cleanup yet. The 2026-05-08 inventory (Phase 1B) catalogued the sediment.

Class Cleanup rule Cleanup trigger
Canonical Edits only via report-proposed wording. Retirement requires a Phase 2-style canonical-doc patch that explicitly moves the doc to archive. Manual; never automatic.
Sprint On sprint end, distill durable lessons into Canonical, then move the sprint files to docs/archive/sprints/<sprint-id>/. The active sprint folder always holds at most one sprint's worth of files. End of sprint; supervisor explicit signal.
Run-state Active checkpoint stays at data/run-state/<date>.json. After successful run completion, move to data/run-state/archive/<YYYY-MM>/. Older checkpoints already in archive/ are kept for 30 days then tar+gzip. Cron-driven [TARGET]; or on next launch, the supervisor archives the prior checkpoint as part of the new launch.
Evidence Never deleted (it's evidence). Rotation: keep last 14 days at data/run-reports/, older months in data/run-reports/archive/<YYYY-MM>.tar.gz. Same for data/rehearsal-reports/ (last 7 days unrotated). Scheduled rotation [TARGET].
Temp Deleted at session end. tmp/ is gitignored (verify with each session). End of session; or session start (clear leftovers from prior crashes).

If a file does not have a cleanup trigger, it does not have a class. Reclassify or remove it.

Why this section exists

10+ hour autonomous runs are not a matter of how good any single CC session is; they are a matter of whether the artifact stack stays small enough to remain readable. Every autonomous run that ended in confusion since 2026-04-27 had at least one of: an empty heartbeat (no fixed Sprint state), a sprint brief that grew past 200 lines (no net-zero rule), a directory of ~30 reports the next session did not read (no evidence default-skip), or a temp file left over from a prior crash (no temp cleanup). This section is the antidote to all four.

A long autonomous run with this discipline looks like: one canonical set under fixed caps; one sprint folder with one brief, one sprint summary, one checkpoint; one evidence directory growing append-only and sweep-archived; an empty tmp/ between sessions. A long autonomous run without it looks like 47 checkpoint summaries, 265 reports, three "current" briefs disagreeing on the goal, and a context budget that says yes to everything.

Pick the first.


Eval Gates

Eval Gates

Before marking any task done, at least one of these must pass:

Gate Type When to Use Example
Script exit code Deterministic scripts node scripts/check-ledger-consistency.mjs exits 0
File existence + content check Data pipeline outputs data/daily-signals-YYYY-MM-DD.json exists, has >0 scored entries
Grep/pattern match Document updates grep -q "status: done" in relevant YAML
Dry-run comparison Write operations Dry-run output matches expected shape
Build success Site/deploy changes npx vercel build --prod exits 0

"I reviewed the output and it looks correct" is not an eval gate. It's a human judgment that can't be replayed.


Parallel Runs

Parallel Runs

Multiple CC tasks may run simultaneously. Rules:

  1. No shared mutable state. Parallel tasks must not write to the same files. If two tasks need to update the same checkpoint JSON or canonical doc, they run sequentially.
  2. Separate output paths. Each parallel task writes to its own output directory or file prefix.
  3. Don't kill siblings. A task must never terminate another running task. Check ps aux | grep cc- before any process management.
  4. Checkpoint independently. Each parallel task maintains its own checkpoint in data/run-state/YYYY-MM-DD-{task-label}.json.

Recovery After Crashes

Recovery After Crashes

If a session crashes or is interrupted:

  1. Read the last checkpoint in data/run-state/. It shows what completed and what remains.
  2. Check git log for commits made during the session.
  3. Don't re-run completed tasks unless verification fails. Tasks are designed to be idempotent, but unnecessary re-runs waste tokens.
  4. Resume from the next uncompleted task in the original session prompt.
  5. If no checkpoint exists, the session completed zero tasks reliably. Start from the beginning of the task list.

Watchdog Patterns

Watchdog Patterns

For sessions longer than 2 hours or 10+ CC runs:

  • Gutter detection (from Ralph loop pattern): same command failed 3+ times, or no progress on checklist items across 3 consecutive runs → stop and document.
  • Token budget tracking: if context usage exceeds 60%, force a checkpoint write. At 80%, consider stopping the session.
  • Staleness check: if the last checkpoint is >30 minutes old and no new commits exist, something may be stuck.

What This Guide Does NOT Cover

What This Guide Does NOT Cover

  • How to write the master script (run-daily-cycle.mjs) — see OPERATING-PRINCIPLES.md §4 Master Script Contract.
  • How to execute X actions — see docs/reference/browser-execution/BROWSER-QUEUE-RUNBOOK.md and docs/specs/action-catalog.md.
  • Task-specific acceptance criteria — supplied by the launch prompt's sprint brief and the per-task DoD; not centralized in a canonical doc.
  • Vercel deployment — see publishing-repo docs. For x-archive docs rendered on site, use local prebuilt deploy (npx vercel build --prod + npx vercel deploy --prebuilt --prod --yes) because remote Vercel cannot see local x-archive.

Sources

Sources

Local Evidence

  • Overnight Postmortem 2026-04-28 (docs/archive/OVERNIGHT-POSTMORTEM-2026-04-28.md): 42-run analysis
  • Memory 2026-04-28 (~/.openclaw/workspace/memory/2026-04-28.md): session log
  • 71 CC run logs under /tmp/cc-*20260428*: task outputs
  • 12 git commits 2026-04-27 → 2026-04-28: verified changes

External References

  • Anthropic, "Effective Harnesses for Long-Running Agents" (2025-11): two-agent init/code pattern, progress files, failed-approach tracking
  • Anthropic, "Claude Code Auto Mode" (2026-03): 17% false-negative rate, scope escalation incidents, three-tier allowlist
  • Anthropic, "Long-Running Claude for Scientific Computing" (2026-03): CLAUDE.md as plan, CHANGELOG.md as lab notebook, git-commit-per-unit, reference implementations as test oracles
  • Stanford, "SWE-chat: Coding Agent Interactions From Real Users" (2026-04): 44% code survival rate, 9x security vulnerability rate in full-autonomy mode
  • Knightli, "Ralph: Autonomous Agent Loop for Claude Code" (2026-04): context rotation, file-based state, token tracking zones, gutter detection, 20-iteration cap
  • Sattyam Jain, "The Agent That Burned $4,200 in 63 Hours" (2026-04): budget guards on four dimensions
  • Vectara, "Awesome Agent Failures" (GitHub): curated catalog of production agent incidents
  • DEV Community, "HITL for AI Agents: Patterns and Best Practices" (2025-04): composite confidence scoring, 30-minute timeout, reviewer fatigue data
  • Sakura Sky, "Trustworthy AI Agents: Deterministic Replay" (2025-11): JSONL event sourcing, replay clients, golden file testing
Daily Run Reliability Campaign

Daily Run Reliability Campaign

After the component-level DoD for original posts and mafia engagement is closed, the next phase is to debug the full daily run itself. Run the scoped script-first daily cycle end-to-end at least 10 counted times. Each counted run must include structured results plus analysis. If a run fails, flakes, or exposes weak evidence, fix/improve the system and rerun. Do not treat blind repetition as evidence. The campaign is complete only when 10 analyzed runs pass reliably against the current implementation, with no live X writes and all operator gates explicit.

Child Agent Intention Chain

Child Agent Intention Chain

When delegating development/debugging/reliability work to Claude Code or any child agent, pass the full chain of intentions and goals, not just the narrow implementation step. The child agent should understand the product goal, Seva's risk/time posture, prior decisions, current DoD, safety boundaries, out-of-scope areas, and how this slice feeds the next phase. This lets capable agents make better local choices and catch mismatches that a ticket-sized prompt would hide.

Do not include irrelevant transcript noise, secrets, or unrelated projects. The rule is not “dump everything”; it is “preserve the intention chain required to reason like an owner.”