Skip to main content

docs/telemetry.md

Imported Content

Telemetry and Endpoint Automation

Event schema types

  • HostHeartbeat
  • PatchCompliance
  • BackupFailure
  • ServiceDegraded
  • PipelineFailure

Ingest path

  1. Edge agent buffers telemetry locally.
  2. Agent flushes payload to POST /api/v1/events/ingest over HTTPS with:
    • Authorization: Bearer <edge enrollment token>
    • x-workspace-id: <workspace uuid>
    • Idempotency-Key: <uuid>
  3. Gateway validates JWT role/scope, idempotency key, and event schema contract.
  4. Routed actions are persisted to outbox_events.
  5. Outbox dispatcher forwards actions to AgentField reasoners.
  6. AgentField completion callbacks persist workflow and audit evidence records.

Note:

  • mTLS is supported at ingress/network layer when enabled by platform infra policy.
  • Gateway route auth contract is currently bearer JWT + workspace header.

Enrollment and visibility APIs

  • GET /api/v1/edge-agent/enrollment: effective enrollment mode/policy metadata and ingest URL.
  • POST /api/v1/edge-agent/enrollment/rotate: rotates workspace enrollment token with idempotency and returns token/expiry metadata.
  • GET /api/v1/edge-agent/ingest-status: 24h ingest batch/event totals, schema rejection count, and outbox pending/failed counts.

Admin/Ops UI expectations

  • Admin Settings contains enrollment rotation UX and token display window for secure handoff.
  • Ops exposes ingest and routing health as first-class queue/latency context.
  • Alert policy controls (mode, approval, retry bounds) are managed in structured settings forms, not raw JSON by default.

Operational closure flow:

  1. alert.triggered actions can be linked to tickets via POST /api/v1/ops/alerts/{id}/link-ticket.
  2. Tickets can be linked to runbooks/assets and escalated/resolved from Ops routes.
  3. Workflow retries/cancellations are exposed through workflow control interfaces.

Reliability controls

  • Offline-safe replay from local buffer.
  • Idempotent mutation semantics.
  • Correlation IDs across ingest and workflow pipelines.
  • Outbox retry/backoff with terminal failed status after max attempts.
  • Operator retry controls via /api/v1/agent-runtime/outbox-failures/{id}/retry.

How metrics are measured

Telemetry and runtime metrics in Admin/Ops are measured from persisted database state:

  • events24h / batches24h: aggregated from telemetry_ingest_batches.
  • pendingOutbox / failedOutbox: aggregated from outbox_events by status.
  • Worker runtime health: heartbeat recency and status in worker_runtime_status.
  • Ops failure buckets: failed rows from outbox_events, background_jobs, and workflow_executions.
  • Console/Portal KPIs (tickets, alerts, invoices): computed from PSA resource records.

How a system hooks into Anchor

  1. In Admin Settings, rotate edge enrollment token for the target workspace.
  2. Install/configure endpoint agent with:
    • API base URL
    • workspace ID
    • rotated enrollment token
  3. Agent pulls GET /api/v1/edge-agent/policy and starts collection loops.
  4. Agent posts batches to POST /api/v1/events/ingest with idempotency keys.
  5. Verify ingest and routing in UI:
    • Admin Settings -> Telemetry Enrollment stats
    • Ops -> Failed Outbox/Jobs and runtime health

Control-center telemetry integration

  • Gateway telemetry client uses @egintegrations/telemetry.
  • Required env vars:
    • ENGINE_ID
    • ENGINE_SKU
    • EGI_CONTROL_CENTER_URL
    • EGI_TELEMETRY_TOKEN (optional if control center allows anonymous writes)
    • EGI_TELEMETRY_ENABLED
  • Runtime health compatibility endpoint:
    • GET /.well-known/engine-status

Production note:

  • Staging/production enforce explicit control-center URLs and reject status-mock endpoints.
  • Staging/production require EGI_CONTROL_CENTER_URL to use https://.
  • Run Argo prune sync after disabling statusMock so stale mock resources are removed.

Local/dev note:

  • apps/status-mock remains available for local telemetry/status contract tests.