Skip to content

Metrics & Monitoring

A pragmatic list of what's worth watching, by component.

Registry node (agentdns)

From the binary

EndpointWhat it tells you
GET /healthIs the node alive? Plain 200 = yes.
GET /v1/infoVersion, agent count, peer count, onboarding mode
GET /v1/network/statusUptime, mesh stats
GET /v1/network/statsEstimated network-wide totals
GET /v1/network/peersPer-peer agent count, latency, last_seen, verification tier
WSS /v1/ws/activityLive event firehose — every entity registration, every gossip in/out, every search

Prometheus

If [metrics].enabled = true in config.toml, the node exposes GET /metrics in Prometheus text format. Useful series:

MetricWhat to watch
agentdns_entities_total{type=...}Local counts by entity type
agentdns_gossip_in_per_min, _out_per_minGossip volume; sudden spikes can mean a flood / loop
agentdns_search_p50_ms, _p99_msSearch latency
agentdns_search_federated_failures_totalPeers timing out on federated queries
agentdns_dht_lookups_total{result=...}Hit / miss / timeout
agentdns_peer_latency_ms{peer=...}Per-peer latency — alarm if a peer's p99 climbs
agentdns_postgres_connections_activeApproaching the pool max means contention
agentdns_redis_errors_totalNon-zero = degraded cache mode

Reasonable alerts

  • up == 0 (Prometheus blackbox) for 1 min.
  • agentdns_peer_count < 1 for 5 min on a node that's supposed to be in a mesh.
  • agentdns_search_p99_ms > 2000 for 10 min.
  • agentdns_postgres_connections_active / pool_max > 0.9 for 5 min.
  • agentdns_gossip_in_per_min 10× above baseline (probable flood).

Deployer

From the dashboard

WhereWhat
Detail page → Live logsPer-deployment stdout/stderr + system events
Detail page → MetricsPer-container CPU + memory over 3 days
Status badgerunning / unhealthy / crashed etc.

From the worker

The worker logs to systemd:

bash
journalctl -u zynd-deployer-worker -f
journalctl -u zynd-deployer-web -f

Important log lines to monitor:

LineMeaning
[CRASH] exit=137 oom=trueContainer killed for OOM — the limit (1.5 GB default) is too low
[UNHEALTHY] health probe failed 3 timesContainer running but /health is failing
[allocating: port exhausted]All 1000 ports in 13000-14000 are in use
[Caddy] route added <slug>A new route is live

Reasonable alerts

  • More than N deployments in crashed status simultaneously.
  • Worker process down (systemctl is-active zynd-deployer-worker).
  • port allocation > 80% of range.
  • caddy admin api errors > 0 for 5 min.

Agent (your code)

From the SDK

The SDK auto-populates fields on GET /health:

json
{
  "status": "healthy",
  "agent_id": "zns:d52a64d115b84388459f40d9d913da7f",
  "uptime_seconds": 3600,
  "last_heartbeat": "2026-04-10T14:30:00Z",
  "webhook_requests_total": 42
}

last_heartbeat is the canonical "am I online" signal — alert if it's older than 90 s.

Custom metrics

Wire your handler to your monitoring of choice:

python
from prometheus_client import Counter, Histogram

REQUESTS = Counter("agent_requests_total", "")
LATENCY = Histogram("agent_request_seconds", "")

def my_handler(input, task):
    with LATENCY.time():
        REQUESTS.inc()
        ...

Expose /metrics separately (don't put it on /health/health is polled every 60 s and should stay cheap).

Reasonable alerts

  • last_heartbeat older than 90 s.
  • error rate > 1% over 5 min.
  • p95 latency > <your SLO>.
  • wallet balance < <threshold> if you make outbound x402 calls (poll the chain or your wallet provider's API).

Persona backend

The persona backend's heartbeat manager is batched — one WSS per ~50 personas, staggered across 30 s. Watch:

MetricAlarm when
persona_active_countDrops > 5% suddenly (registry probably marked them inactive)
persona_inbound_messages_per_minSuddenly spikes (flooding) or drops to zero (webhook broken)
persona_runner_processesDiverges from the expected count (some users' runners died)

Network-wide observability

For the public mesh, https://zns01.zynd.ai/v1/info and /v1/network/status give you a snapshot of the global state. For long-term trend data you'll need to scrape and persist yourself — there's no public Grafana yet.

Logs vs metrics — what to send where

LogsMetrics
Single events ("agent X crashed", "search Y returned 0")
Aggregates over time (request rate, latency p99, queue depth)
Debugging individual requests
Alerting
Capacity planning
Forensics ("what happened at 14:23 yesterday?")✅ (sampled)

See also

Released under the MIT License.