Skip to main content

docs/runbooks.md

Imported Content

Operations Runbooks

Incident: telemetry ingest outage

  1. Verify gateway health endpoint and deployment status.
  2. Inspect edge agent buffer growth rate on impacted endpoints.
  3. Validate mTLS certificate chain and token validity.
  4. Drain buffered events after gateway recovery.

Incident: workflow backlog growth

  1. Inspect AgentField queue depth and execution latency.
  2. Scale reasoner worker pools and examine failure buckets.
  3. Replay dead-letter events after root-cause mitigation.
  4. Confirm SLA and billing side effects are complete.

Acceptance runbook: agent reasoner smoke

  1. Create or select a workspace with valid JWT access.
  2. Execute POST /api/v1/agent-runtime/smoke (single agent or all launch agents).
  3. Verify execution appears in GET /api/v1/workflow-executions.
  4. Verify callback persistence in GET /api/v1/workflow-executions/{id}.
  5. Confirm audit credential exists in GET /api/v1/audit-credentials.
  6. If failed, execute retry/cancel controls and re-check runtime failure buckets.

Incident: outbox dispatcher stalled

  1. Check anchor-outbox-worker pod health and logs.
  2. Inspect outbox_events for sustained pending/processing growth.
  3. Verify AgentField connectivity and authentication from worker.
  4. Restart worker deployment and monitor delivery status transitions.
  5. Requeue any dead-letter events from POST /api/v1/agent-runtime/outbox-failures/{id}/retry.

Incident: failed import/export pipeline jobs

  1. Inspect failed jobs in GET /api/v1/agent-runtime/failures (failedJobs list).
  2. Validate adapter credentials and object-store targets in admin settings.
  3. Requeue failed jobs using POST /api/v1/agent-runtime/job-failures/{id}/retry.
  4. Confirm queued/running/succeeded progression in GET /api/v1/jobs.

Incident: production rollback

  1. Revert infra/helm/platform/environments/prod/values.yaml to prior image tag.
  2. Merge rollback PR and validate Argo CD sync health.
  3. Run smoke checks on /health, execute route, and ingest route.
  4. Export rollback evidence and attach to incident record.

Rollback drill checklist (required before release)

  1. Capture current prod image tags from environments/prod/release.yaml.
  2. Create an ops PR that pins prod to the previous known-good tag.
  3. Merge PR and run Argo hard refresh + prune sync.
  4. Verify anchor-prod is Synced and Healthy.
  5. Verify:
    • GET https://ops.anchor-msp.com
    • GET https://ops.anchor-msp.com/api/v1/health
    • GET https://ops.anchor-msp.com/api/v1/openapi
  6. Create forward-fix PR restoring intended release tag.
  7. Merge and repeat health checks.

Data onboarding: first live tenant

  1. Apply migrations (scripts/apply-migrations.sh).
  2. Run scripts/bootstrap/onboard-workspace.sh with real tenant inputs:
    • WORKSPACE_NAME
    • WORKSPACE_SLUG
    • CLIENT_NAME
    • CONTACT_FIRST_NAME
    • CONTACT_LAST_NAME
    • CONTACT_EMAIL
    • ASSET_HOSTNAME
    • ASSET_OS
    • TICKET_TITLE
  3. Verify API summaries return 200 for the new workspace token/ID.
  4. Verify operator UI shows workspace switcher and non-empty counters.
  5. Verify portal UI loads ticket/approval/invoice sections without placeholder cards.
  6. Generate a portal invite from /admin/settings and validate:
    • invite URL opens the portal with workspace context
    • token grants client_portal only
    • revocation immediately blocks access.

Daily MSP operations runbook (UI-first)

  1. Open Console and select workspace.
  2. Review module queues in order:
    • Tickets for new/in-progress work.
    • Alerts for open/critical alerts.
    • Invoices for overdue/payment follow-up.
  3. Use lifecycle controls (status transition + archive) directly from module list rows.
  4. Open Ops for workflow failures and runtime health:
    • resolve critical alert queue first
    • retry failed workflows/outbox/jobs with reason capture
    • validate AgentField worker heartbeat + queue depth.
  5. Use Portal invite session to spot-check client-scoped UX for one tenant daily.
  6. Use Admin Settings to confirm telemetry ingest and adapter connectivity statuses remain healthy.

First-client onboarding runbook (UI-first)

  1. In Console create baseline records:
    • Clients
    • Contacts
    • Assets
    • required Services.
  2. In Admin Settings, rotate endpoint enrollment token and hand off enrollment bundle to endpoint deployment staff.
  3. Verify telemetry ingest appears in Admin/Ops views (batches24h, events24h, and no schema rejection spike).
  4. Generate portal invite in Admin Settings, share invite URL, and validate client session handoff.
  5. In Portal, verify the client can:
    • submit/update ticket
    • approve/reject quote
    • acknowledge invoice/payment intent
    • browse KB.
  6. In Ops, trigger and verify one smoke AgentField execution for the tenant to validate callback->workflow->audit chain.

Capacity: 4GB node saturation

  1. Verify sustained CPU or memory pressure from kubectl top pods and pod scheduling failures.
  2. Scale non-prod namespaces (anchor-dev, anchor-staging) to zero before changing prod.
  3. Keep web and gateway at replicaCount=1 with maxSurge=0/maxUnavailable=1 on single-node profile.
  4. Move stateful dependencies to managed services if still in-cluster.
  5. If saturation persists, raise a capacity incident and bump node class before enabling extra replicas.