Observability¶

Scrape metrics, probe health, and tune logging for a running OpenCCU-Loom daemon.

Who this page is for

Administrators wiring the daemon into Prometheus, an uptime monitor, or a log pipeline. For triage of specific failures, see the troubleshooting guide (docs/admin/troubleshooting.md).

Metrics¶

The daemon exposes a Prometheus endpoint in the standard text exposition format.

Property	Value
Endpoint	`GET /api/v1/metrics`
Content type	`text/plain; version=0.0.4; charset=utf-8`
Auth	Required — the endpoint sits inside the authenticated route group.

The metrics endpoint is authenticated

Unlike many Prometheus exporters, /api/v1/metrics is not anonymous — it is mounted inside the daemon's auth-required group. Your scrape job must present credentials.

Prometheus scrape config¶

Use an API token (see the authentication guide) so the scraper carries a bearer credential, which also bypasses CSRF:

scrape_configs:
  - job_name: openccu-loom
    metrics_path: /api/v1/metrics
    scheme: https            # if a TLS proxy fronts the daemon
    authorization:
      type: Bearer
      credentials: "<api-token>"
    static_configs:
      - targets: ["loom.example:8080"]   # north.rest.listen, default :8080

Basic auth works too, via Prometheus's basic_auth block, if you prefer a Basic user over a token.

Health¶

There are two health surfaces: a JSON endpoint on the REST API and a server-rendered page on the UI port.

REST health endpoint¶

Property	Value
Endpoint	`GET /api/v1/health`
Auth	None — reachable without credentials so a load balancer can probe it.
Body	`{"status": "...", "components": [{ "name", "status", "note", "recorded_at" }]}`
HTTP code	`200` normally, `503` when the service is genuinely unavailable.

The top-level status is a service-availability collapse, not the raw worst component: a single south-bound interface or the MQTT bridge going down only degrades the daemon — the REST/UI surface keeps serving. Only a fatal dependency (the sqlite persistence layer or the central coordinator) being unhealthy, or every interface being down at once, maps to 503.

Component statuses¶

Each component reports one of four statuses:

Status	Meaning
`healthy`	Last sample was good.
`degraded`	A single recent failure after a healthy run (flap-damped).
`unhealthy`	Repeated failure / hard fault.
`unknown`	Never reported yet, or its last sample is stale (older than ~90 s) — a component that has gone silent does not stay green.

UI health page¶

The bootstrap UI on the UI port (default :8081) serves a server-rendered /health page (with / redirecting to it). It is the SPA-down fallback for diagnosing the daemon when the JavaScript bundle will not load; /about on the same port shows the version and license.

Using health in a probe¶

Point your load-balancer or uptime monitor at the REST endpoint and treat the HTTP code as the verdict:

# Healthy/degraded → 200; service down → 503
curl -fsS https://loom.example/api/v1/health

Probe code vs. body

For a binary up/down check, rely on the HTTP status (503 = drain this instance). For a dashboard, parse the components array and render per-component status — a degraded body still returns 200, which is intentional.

Logging¶

Logging is structured (log/slog) and configured under the top-level logging block.

logging:
  level: info        # debug | info | warn | error (default: info)
  format: json       # json | text | text-color (default: json)
  overrides:         # optional per-subsystem static overrides
    openccu-loom.client.transport.xmlrpc: debug

level sets the global default; invalid values are rejected at config load.
format selects json (default), plain text, or colourised text-color.
overrides maps a dot-separated subsystem path to a level. Overrides resolve hierarchically — an override on openccu-loom.client applies to every descendant unless that descendant has its own override.

Dynamic per-subsystem log levels¶

You can change levels at runtime without editing YAML or restarting, through admin-gated diagnostics endpoints:

Method & path	Purpose
`GET /api/v1/diagnostics/log-levels`	List the default level and every active override.
`PUT /api/v1/diagnostics/log-levels/{path}`	Install/replace an override for a subsystem path.
`DELETE /api/v1/diagnostics/log-levels/{path}`	Remove an override (idempotent).

The PUT body carries the level and an optional TTL:

curl -u admin:… -X PUT \
  https://loom.example/api/v1/diagnostics/log-levels/openccu-loom.client \
  -H 'Content-Type: application/json' \
  -d '{"level":"debug","ttl_seconds":600}'

ttl_seconds: 0 (or omitted) makes the override permanent (until reset or restart).
A positive TTL auto-expires; the endpoint caps user-supplied TTLs at 24 hours so a forgotten debug override does not run forever.

Use a TTL when debugging live

Set a bounded ttl_seconds when you raise verbosity on a busy subsystem in production — it reverts itself even if you forget.

Redaction of secrets in logs¶

A redaction handler masks sensitive attribute values with ***REDACTED*** before they reach any log sink. Matching is case-insensitive and substring-based, so nested groups (oidc.client_secret) and header-style keys (X-Api-Key) are caught. The masked key set includes password, passwd, secret, token, api_key / api-key / apikey, authorization, auth_header, cookie, set_cookie, session_id / sessionid, client_secret, refresh_token, access_token, id_token, bearer, and private_key.

Redaction is shallow on opaque values

Top-level attributes and slog.Group members are inspected; arbitrary map/struct values that arrive as a single opaque value are not introspected. Expose individual fields via slog.Group or slog.Attr if you need them redacted.

Security guide — auth, secrets at rest, TLS posture.
Authentication & users — minting the API token your scraper needs.
Troubleshooting — failure triage lives in the admin troubleshooting guide (docs/admin/troubleshooting.md).