Imported Content
Operations Runbooks
Incident: telemetry ingest outage
- Verify gateway health endpoint and deployment status.
- Inspect edge agent buffer growth rate on impacted endpoints.
- Validate mTLS certificate chain and token validity.
- Drain buffered events after gateway recovery.
Incident: workflow backlog growth
- Inspect AgentField queue depth and execution latency.
- Scale reasoner worker pools and examine failure buckets.
- Replay dead-letter events after root-cause mitigation.
- Confirm SLA and billing side effects are complete.
Acceptance runbook: agent reasoner smoke
- Create or select a workspace with valid JWT access.
- Execute
POST /api/v1/agent-runtime/smoke (single agent or all launch agents).
- Verify execution appears in
GET /api/v1/workflow-executions.
- Verify callback persistence in
GET /api/v1/workflow-executions/{id}.
- Confirm audit credential exists in
GET /api/v1/audit-credentials.
- If failed, execute retry/cancel controls and re-check runtime failure buckets.
Incident: outbox dispatcher stalled
- Check
anchor-outbox-worker pod health and logs.
- Inspect
outbox_events for sustained pending/processing growth.
- Verify AgentField connectivity and authentication from worker.
- Restart worker deployment and monitor delivery status transitions.
- Requeue any dead-letter events from
POST /api/v1/agent-runtime/outbox-failures/{id}/retry.
Incident: failed import/export pipeline jobs
- Inspect failed jobs in
GET /api/v1/agent-runtime/failures (failedJobs list).
- Validate adapter credentials and object-store targets in admin settings.
- Requeue failed jobs using
POST /api/v1/agent-runtime/job-failures/{id}/retry.
- Confirm queued/running/succeeded progression in
GET /api/v1/jobs.
Incident: production rollback
- Revert
infra/helm/platform/environments/prod/values.yaml to prior image tag.
- Merge rollback PR and validate Argo CD sync health.
- Run smoke checks on
/health, execute route, and ingest route.
- Export rollback evidence and attach to incident record.
Rollback drill checklist (required before release)
- Capture current prod image tags from
environments/prod/release.yaml.
- Create an ops PR that pins prod to the previous known-good tag.
- Merge PR and run Argo hard refresh + prune sync.
- Verify
anchor-prod is Synced and Healthy.
- Verify:
GET https://ops.anchor-msp.com
GET https://ops.anchor-msp.com/api/v1/health
GET https://ops.anchor-msp.com/api/v1/openapi
- Create forward-fix PR restoring intended release tag.
- Merge and repeat health checks.
Data onboarding: first live tenant
- Apply migrations (
scripts/apply-migrations.sh).
- Run
scripts/bootstrap/onboard-workspace.sh with real tenant inputs:
WORKSPACE_NAME
WORKSPACE_SLUG
CLIENT_NAME
CONTACT_FIRST_NAME
CONTACT_LAST_NAME
CONTACT_EMAIL
ASSET_HOSTNAME
ASSET_OS
TICKET_TITLE
- Verify API summaries return
200 for the new workspace token/ID.
- Verify operator UI shows workspace switcher and non-empty counters.
- Verify portal UI loads ticket/approval/invoice sections without placeholder cards.
- Generate a portal invite from
/admin/settings and validate:
- invite URL opens the portal with workspace context
- token grants
client_portal only
- revocation immediately blocks access.
Daily MSP operations runbook (UI-first)
- Open
Console and select workspace.
- Review module queues in order:
Tickets for new/in-progress work.
Alerts for open/critical alerts.
Invoices for overdue/payment follow-up.
- Use lifecycle controls (status transition + archive) directly from module list rows.
- Open
Ops for workflow failures and runtime health:
- resolve critical alert queue first
- retry failed workflows/outbox/jobs with reason capture
- validate AgentField worker heartbeat + queue depth.
- Use
Portal invite session to spot-check client-scoped UX for one tenant daily.
- Use
Admin Settings to confirm telemetry ingest and adapter connectivity statuses remain healthy.
First-client onboarding runbook (UI-first)
- In
Console create baseline records:
Clients
Contacts
Assets
- required
Services.
- In
Admin Settings, rotate endpoint enrollment token and hand off enrollment bundle to endpoint deployment staff.
- Verify telemetry ingest appears in Admin/Ops views (
batches24h, events24h, and no schema rejection spike).
- Generate portal invite in
Admin Settings, share invite URL, and validate client session handoff.
- In
Portal, verify the client can:
- submit/update ticket
- approve/reject quote
- acknowledge invoice/payment intent
- browse KB.
- In
Ops, trigger and verify one smoke AgentField execution for the tenant to validate callback->workflow->audit chain.
Capacity: 4GB node saturation
- Verify sustained CPU or memory pressure from
kubectl top pods and pod scheduling failures.
- Scale non-prod namespaces (
anchor-dev, anchor-staging) to zero before changing prod.
- Keep
web and gateway at replicaCount=1 with maxSurge=0/maxUnavailable=1 on single-node profile.
- Move stateful dependencies to managed services if still in-cluster.
- If saturation persists, raise a capacity incident and bump node class before enabling extra replicas.