Skip to main content

docs/runbooks.md

Metadata

  • Purpose: Project documentation source file.
  • Domain: documentation
  • Language: md
  • Bytes: 5432
  • Lines: 121
  • Content hash (short): a352274d
  • Source (start): docs/runbooks.md:1
  • Source (end): docs/runbooks.md:121

Indexed Symbols

No indexed functions/methods detected in this file.

Markdown Headings (if applicable)

  • H1: Operations Runbooks (line 1)
  • H2: Incident: telemetry ingest outage (line 3)
  • H2: Incident: workflow backlog growth (line 10)
  • H2: Acceptance runbook: agent reasoner smoke (line 17)
  • H2: Incident: outbox dispatcher stalled (line 26)
  • H2: Incident: failed import/export pipeline jobs (line 34)
  • H2: Incident: production rollback (line 41)
  • H3: Rollback drill checklist (required before release) (line 48)
  • H2: Data onboarding: first live tenant (line 61)
  • H2: Daily MSP operations runbook (UI-first) (line 82)
  • H2: First-client onboarding runbook (UI-first) (line 97)
  • H2: Capacity: 4GB node saturation (line 114)

Source Preview

# Operations Runbooks

## Incident: telemetry ingest outage

1. Verify gateway health endpoint and deployment status.
2. Inspect edge agent buffer growth rate on impacted endpoints.
3. Validate mTLS certificate chain and token validity.
4. Drain buffered events after gateway recovery.

## Incident: workflow backlog growth

1. Inspect AgentField queue depth and execution latency.
2. Scale reasoner worker pools and examine failure buckets.
3. Replay dead-letter events after root-cause mitigation.
4. Confirm SLA and billing side effects are complete.

## Acceptance runbook: agent reasoner smoke

1. Create or select a workspace with valid JWT access.
2. Execute `POST /api/v1/agent-runtime/smoke` (single agent or all launch agents).
3. Verify execution appears in `GET /api/v1/workflow-executions`.
4. Verify callback persistence in `GET /api/v1/workflow-executions/{id}`.
5. Confirm audit credential exists in `GET /api/v1/audit-credentials`.
6. If failed, execute retry/cancel controls and re-check runtime failure buckets.