Skip to content
Self-hosting

Operations

Daily operating checks for the review service: health, queue, logs, metrics, dashboards, and context services.

Health endpoints

/health
Liveness. Use for simple process checks.
/ready
Readiness. Use for orchestration because it waits for DB and migrations.
/metrics
Prometheus metrics for queues, jobs, HTTP requests, uptime, and AI usage.

Useful commands

docker compose ps
docker compose logs -f gittensory
curl http://localhost:8787/ready
curl http://localhost:8787/metrics
bash

Important log events

selfhost_listening
selfhost_migrations_applied
selfhost_ai_provider
selfhost_ai_review_plan
selfhost_embed_provider
selfhost_vectorize
selfhost_job_dead
selfhost_cron_error
review_context_fetch_failed

Observability profile

The observability profile starts Prometheus, Alertmanager, Loki, Promtail, and Grafana with dashboards for infra, review activity, and AI usage.

Postgres installs also expose database internals through the bundled Postgres exporter: connection pressure, lock waits, long transactions, deadlocks, database/table growth, dead tuples, autovacuum activity, and backup freshness. Backup freshness appears when the backup profile is active.

When OpenTelemetry and Sentry are enabled, job audit logs and Sentry events include trace_id/span_id fields so an operator can jump from a failed job or issue to the matching trace in Grafana or Tempo.

docker compose --profile postgres --profile observability up -d
docker compose --profile postgres --profile observability --profile backup up -d
bash

Alerting — required for a 24/7 deployment

Alertmanager ships with a valid but silent default: every alert routes to a name-only receiver that discards it, so docker compose --profile observability up -d always starts clean even before you've configured anywhere to send notifications. This is intentional — the shipped config can't bake in a Slack/Discord/email destination that works for everyone — but it means nothing pages anyone until you edit alertmanager/alertmanager.yml yourself. Treat this as a required step, not an optional one, for any deployment you expect to run unattended.

The fastest verified path: create a Discord channel webhook (channel settings → Integrations → Webhooks → New Webhook), then uncomment the discord receiver block in alertmanager/alertmanager.yml and point the root route at it. Slack, email, and a generic webhook receiver (for PagerDuty or a custom handler) are also ready to uncomment in the same file.

Until you do, alerts are still visible without any extra setup: open Grafana and check the Alerts row on the main dashboard, which lists every currently-firing alert directly from Prometheus, independent of Alertmanager routing. Use this as your fallback check if you haven't wired up push notifications yet — it's exactly what the Dead jobs stay at zero routine check below is watching for.

Dead-lettered jobs also get one automatic revival attempt every 30 minutes (QUEUE_DEAD_LETTER_REVIVE_INTERVAL_MS), as long as the job hasn't already been revived more than a small, bounded number of extra times (QUEUE_DEAD_LETTER_AUTO_RETRY_MAX_EXTRA_ATTEMPTS, default 3) — so a job that died from a bug that's since been fixed and redeployed recovers on its own within the next cycle, without needing direct database access. A job that keeps failing the same way eventually exhausts this budget and stays dead, which is exactly what the alert above is watching for.

Docker resource hygiene

Every service in docker-compose.yml caps its own container logs (10MB × 3 rotated files) out of the box, so log growth alone won't fill your disk. Unused Docker images and build cache are a separate, larger disk-growth vector on a host that rebuilds or pulls images repeatedly over months — Docker does not reclaim either automatically.

Install the provided host-level timer to reclaim both on a schedule (anything unused for less than 7 days is left alone, so a recent deploy is never at risk):

sudo cp systemd/gittensory-docker-prune.service.example /etc/systemd/system/gittensory-docker-prune.service
sudo cp systemd/gittensory-docker-prune.timer.example /etc/systemd/system/gittensory-docker-prune.timer
sudo $EDITOR /etc/systemd/system/gittensory-docker-prune.service   # set WorkingDirectory / ExecStart to your path
sudo systemctl daemon-reload
sudo systemctl enable --now gittensory-docker-prune.timer
bash

Run it manually at any time with docker system df before and after to see what it reclaimed: sh scripts/selfhost-docker-prune.sh.

This should always prune containers, images, and build cache — never volumes. Pruning a volume deletes real application state (the database, backups, vector index, or a runner's registration and job data), not disposable build output, so it is never part of routine cleanup unless you intentionally want to delete that state.

Self-hosted runner temp storage

If you run --profile runners, keep every runner job's scratch/temp writes on the mounted runner-work volume, never the container's plain /tmp. A container's own /tmp lives in Docker's overlay/containerd snapshot storage — a CI job that writes high-volume temp data there (language toolchain caches, build artifacts, ad hoc mktemp calls) grows the host's Docker root storage directly, not the volume, so it is invisible to volume-scoped cleanup and can fill the disk out from under the whole stack. The shipped runner service points TMPDIR, TMP, and TEMP at /tmp/runner/tmp (a subdirectory of the mounted runner-work volume) and keeps RUNNER_WORKDIR at /tmp/runner on the same volume. A one-shot runner-tmp-init service creates that directory on the volume (and makes it world-writable, matching real /tmp permissions) before the runner container starts, so this works out of the box on a fresh volume with no manual steps.

Adding a second or third runner service in docker-compose.override.yml for higher CI throughput? Each one needs its own runner-work-style volume, its own init step, and the same temp env — YAML anchors don't cross separate compose files, so repeat the extension block in your override file:

x-runner-tmp-env: &runner-tmp-env
  TMPDIR: /tmp/runner/tmp
  TMP: /tmp/runner/tmp
  TEMP: /tmp/runner/tmp

services:
  runner-2-tmp-init:
    image: alpine:3.20
    profiles: ["runners"]
    volumes:
      - runner-work-2:/tmp/runner
    command: ["sh", "-c", "mkdir -p /tmp/runner/tmp && chmod 1777 /tmp/runner/tmp"]

  runner-2:
    image: myoung34/github-runner:ubuntu-jammy
    profiles: ["runners"]
    depends_on:
      runner-2-tmp-init:
        condition: service_completed_successfully
    environment:
      <<: *runner-tmp-env
      RUNNER_NAME: gittensory-runner-2
      RUNNER_SCOPE: ${RUNNER_SCOPE:-repo}
      REPO_URL: ${RUNNER_REPO_URL:-}
      RUNNER_TOKEN: ${RUNNER_TOKEN:-}
      RUNNER_WORKDIR: /tmp/runner
    volumes:
      - runner-work-2:/tmp/runner

volumes:
  runner-work-2:
yaml

Sentry tracing

Leave SENTRY_TRACES_SAMPLE_RATE unset or blank to disable trace export, or set a positive sample rate such as 0.05 to send sampled review spans to Sentry. The custom OpenTelemetry provider installs Sentry hooks for review-stage spans carrying repo, PR, operation, outcome, and hashed installation tags.

Sentry cron monitors

When SENTRY_DSN is set, the self-host runtime emits Sentry monitor check-ins for the recurring loops where silent stoppage matters most. Leaving SENTRY_DSN unset keeps monitor reporting off.

scheduled loop
The two-minute maintenance tick that fans out sweeps, backfills, and refresh jobs.
Orb export
The hourly outcome export loop used by brokered self-host deployments.
Orb relay drain
The pull-mode relay loop for installations that receive events outbound from Orb.
Orb relay register
The recurring retry loop that (re-)registers this instance with the relay broker.

A missed monitor means the process may still be alive but the recurring work is not checking in on schedule. Pair the monitor with queue depth, dead-job counts, and the structured error log for the same subsystem.

Routine checks

  • Queue pending count is not growing without processing.
  • Dead jobs stay at zero or are investigated promptly.
  • Webhook deliveries are recent and have 2xx responses.
  • AI usage matches expected review volume and model/effort choices.
  • REES and RAG failures are visible and bounded.
  • Postgres connections, lock waits, slow transactions, dead tuples, and table growth are stable.
  • Backups are recent and restore-tested.

If an operating check fails, go to Self-host troubleshooting.