The Runtime

When you deploy an app, it runs as its own process — not inside a shared server. A control-plane orchestrator (supervisord) spawns one detached tabbify-runner per app, and each runner joins the private mesh as a peer hosting exactly one workload on a deterministic address. This split is what makes apps survive the supervisor crashing.

Supervisor and runners

supervisord hosts nothing itself. It is pure control plane: it spawns runners, persists their records to disk, monitors liveness, and re-adopts them on restart. Each runner hosts one app and runs it as a Firecracker microVM — Tabbify ships exactly one runtime — serving on [app_ula]:8730.

# Orchestrator; --no-mesh binds loopback for local dev (no root/TUN)
supervisord --runner-bin ./tabbify-runner --no-mesh

# A runner is normally spawned for you, derived from the app UUID
tabbify-runner --uuid 0191e7c2-1111-7222-8333-444455556666 --no-mesh

A supervisor advertises only what it can run: it joins with tag [supervisor] and adds firecracker if /dev/kvm is present. (It may also carry a docker tag when a Docker daemon is reachable and a builder tag when it's designated a build host — these are build-backend capabilities used by the deploy pipeline, not user-selectable runtimes. The app runtime is always Firecracker.)

Lifecycle

You drive apps through the control API on the supervisor's mesh ULA (port 8730), or via the node:

POST /v1/apps/{uuid}/start   spawn runner (idempotent if alive), wait healthy 30s, return app_ula
POST /v1/apps/{uuid}/stop    shutdown runner, forget record, keep cached artifacts
POST /v1/apps/{uuid}/purge   shutdown + clear cached OCI-image artifacts (rootfs)
GET  /v1/apps                list runner records on disk + a live health probe each

start replies with the deterministic address and restart state:

{"state":"running","app_ula":"fd5a:1f02:44a5:240b:121a::1","restart_status":"running","restart_count":0}

Each runner record (RunnerHandle) is JSON on disk — uuid, pid, control_sock, app_ula, restart state. That file is the single source of truth, which is what enables re-adoption.

Health and crash survival

Runners expose a Unix-socket control channel (JSON-lines, one command per connection):

-> {"cmd":"health"}
<- {"reply":"health","state":"running","app_ula":"fd5a:1f02:…::1","pid":12345,"app_health":"serving"}

Because runners are detached, killing supervisord does not kill them — they keep serving. On restart the orchestrator scans on-disk records and probes pid + socket. Living runners are re-adopted with no respawn blip; a record whose process is dead (or whose socket is hung) is force-killed and respawned. A monitor loop reconciles every 5 seconds.

Restart supervision

App-level (L2) restart is the supervisor's job: per-runner exponential backoff (base 10s, cap 300s, doubling per failure, reset after 60s stable) with a crash-loop threshold of 5 failures. Keeping supervisord itself alive (L1) belongs to the launch backend — a container --restart policy or systemd.

docker run -d --restart on-failure --device /dev/net/tun --cap-add NET_ADMIN \
  -v tbf-state:/var/lib/tabbify tabbify-supervisor

Warm start

Runners cache OCI-image artifacts on disk, so a stop then start skips the image fetch. Warm-start then leans on Firecracker snapshot/restore: a snapshot mmaps snap.mem (the guest RAM dump) so the microVM resumes instead of cold-booting. Note: sticky identity lives under SUPERVISOR_DATA_DIR — drop that volume and the app rejoins as a fresh peer with a new address.