Worker Subsystems
The worker is one Node process, ~10 files, each with a single responsibility. They share the Prisma client and a config singleton; everything else is wired explicitly in worker/main.ts.
main.ts — entry & supervision
Three concerns started here:
- Drain queue loop — every 1 s, claim the oldest
queuedrow and calllifecycle.drive(deploymentId). - Reconcile stops loop — every 1 s, find any row whose
statusflipped tostoppedfrom outside the worker (typically the Stop button) and tear down its container, route, and port. - One-shot supervisors — started once at boot:
watchCrashes()—docker eventsconsumer.startMetricsLoop()— CPU / memory sampler.startHealthLoop()—/healthpoller for everyrunningdeployment.startRetentionLoop()— log + metric GC.startWsLogServer()— WebSocket log fan-out for the dashboard.startTailer()for each running deployment so logs resume after restart.
If any of these throw, they're logged and the affected deployment is marked failed or crashed — they never crash the worker process itself.
lifecycle.ts — state machine driver
drive(deploymentId) walks one deployment through every state from unpacking to running. Each transition:
- Update
Deployment.status(Postgres is the source of truth). - Append a system-log line via
appendSystemLog. - Run the stage-specific work.
- On exception, set
status="failed"anderrorMessage, then return.
The function is async but linear — there's no concurrency inside a single deployment's pipeline. Multiple deployments run in parallel because drainQueue calls drive() without awaiting.
docker.ts — Docker runtime glue
Wraps dockerode for everything container-shaped:
| Function | Purpose |
|---|---|
runContainer(deployment, runtimeOpts) | docker run with mem/CPU limits, label deploymentId=<cuid>, port binding 127.0.0.1:<port>:5000, mount workdir read-only. |
stopAndRemove(containerId) | SIGTERM with 10 s grace, then SIGKILL, then rm. |
containerStats(containerId) | One-shot stats read for the metrics sampler. |
containerLogs(containerId) | Streaming log read for the tailer. |
The base images (zynd-deployer/agent-base:latest and zynd-deployer/service-base:latest) are built by infra/install.sh from the runtime templates in worker/runtimes/.
caddy.ts — reverse proxy client
Talks to Caddy's admin API:
ensureServer()— at startup, verifies the wildcard server is configured and creates it if not.addRoute(slug, port)— adds areverse_proxyroute matching<slug>.deployer.<wildcard>.removeRoute(slug)— deletes it.
All calls hit ${CADDY_ADMIN_URL}/config/.... Routes survive Caddy restarts because Caddy persists its admin-API config.
ports.ts — port allocation
PortAllocation is the source of truth for port uniqueness — a unique key on (port) plus a unique key on (deploymentId) make double-allocation impossible.
| Function | Behavior |
|---|---|
allocate(deploymentId) | Loops [13000, 14000], attempts INSERT ... ON CONFLICT DO NOTHING, returns first success. Throws PORT_EXHAUSTED if the range is full. |
release(deploymentId) | Deletes the PortAllocation row by deploymentId. Idempotent. |
Deployment.port is a non-unique mirror that may lag (e.g. stays set on a crashed deployment so the dashboard can still show it).
health.ts — /health polling
Two roles:
- Initial probe during the
health_checkingstate — up to 30 attempts at 1 s spacing. Required to advance. - Steady-state probe during
running— every 60 s. 3 consecutive failures flips the row tounhealthy. The container is left untouched — recovers automatically when probes pass again.
The probe URL is always http://127.0.0.1:<port>/health; deployments that don't expose this endpoint are stuck in health_checking and eventually fail.
logs.ts + wsLogs.ts — log capture & fan-out
logs.ts:
startTailer(deploymentId)— opens a streaming Docker log read, splits into lines, writes each intoDeploymentLogwith monotoniclineNo.stopTailer(deploymentId)— closes the stream cleanly.appendSystemLog(deploymentId, text)— synthetic line withstream="system"for state transitions and worker-emitted events.
wsLogs.ts runs a WebSocket server on a private port. The Next.js dashboard subscribes per deployment ID; new DeploymentLog rows are pushed in real time. Log SSE is also exposed via /api/deployments/[id]/logs for non-WebSocket clients.
metrics.ts — CPU & memory sampling
Every DEPLOYER_METRICS_INTERVAL_MS (default 30 s):
- List all
runningdeployments. - For each, read
containerStats(cgroups:memory.usage_in_bytes,memory.limit_in_bytes, CPU usage delta). - Insert a
DeploymentMetricrow.
Schema is (deploymentId, sampledAt, memUsedMb, memLimitMb, cpuPct). CPU is fraction of one logical CPU (0..N), not a percentage of the limit.
crash.ts — Docker event watcher
Subscribes to docker events filtered to event=die. Each event carries the container's labels — deploymentId lets the watcher resolve back to a row and update status="crashed", lastExitCode, lastCrashAt.
Sub-second detection. Cheaper and faster than polling.
retention.ts — GC
Two passes, daily:
| Target | Default TTL | Env override |
|---|---|---|
DeploymentLog (stream != "system") | 7 days | DEPLOYER_LOG_RETENTION_DAYS |
DeploymentLog (stream = "system") | 30 days | DEPLOYER_SYSTEM_LOG_RETENTION_DAYS |
DeploymentMetric | 3 days | DEPLOYER_METRIC_RETENTION_DAYS |
System logs (state transitions) are kept longer than stdout/stderr because they're small and useful for postmortems.
runtimes/ — language adapters
Two implementations of the same Runtime interface, picked by lib/detect.ts:
| File | Detection | Build step | Launch |
|---|---|---|---|
python.ts | requirements.txt or pyproject.toml present | pip install -r requirements.txt inside the container at first start | uvicorn <module>:app --host 0.0.0.0 --port 5000 (or whatever the agent's entry script is) |
node.ts | package.json present | pnpm install --prod (or npm/yarn if lockfile suggests) | node <entry> from package.json main or start script |
Both runtimes inject the same ZYND_* env block. The agent's own code is responsible for reading those and registering on the Zynd network — the deployer doesn't talk to the registry directly.
What the worker does NOT do
- It doesn't proxy traffic — Caddy does that.
- It doesn't terminate TLS — Caddy + DNS-01 wildcard does that.
- It doesn't sign anything — the agent's keypair stays inside the container.
- It doesn't upload to the registry — the container does that with the keypair the user uploaded.
This separation keeps the worker's blast radius small. A worker compromise leaks running containers' keypairs (because the worker decrypted them at start time), but the master age key alone doesn't give an attacker live access to deployed agents — they'd need access to the running container too.
Next
- API Routes — what the web service exposes.
- Data Model — schemas the worker reads and writes.