Self-hosting

Troubleshooting

Start with readiness and logs, then isolate webhook, queue, AI, REES, RAG, or write-suppression problems.

First checks

docker compose ps
docker compose logs --tail=200 gittensory
curl http://localhost:8787/ready
curl http://localhost:8787/metrics

bash

No review appears

Webhook: Check GitHub App deliveries and confirm /v1/github/webhook receives 2xx responses.
Allowlist: Confirm the repo is in GITTENSORY_REVIEW_REPOS for per-PR features.
Write mode: SELFHOST_DEPLOYMENT_MODE=dry-run or disabled suppresses writes even when review computes.
Policy: gate.aiReview.mode=off or commentMode=off can make AI/comment output intentionally quiet.

AI summary unavailable

Confirm AI_PROVIDER is set and supported.
Confirm the provider key or local endpoint works from inside the container.
Set the matching provider model env, such as ANTHROPIC_AI_MODEL, OPENAI_COMPATIBLE_AI_MODEL, OLLAMA_AI_MODEL, CLAUDE_AI_MODEL, or CODEX_AI_MODEL.
Increase the matching provider timeout env, such as CLAUDE_AI_TIMEOUT_MS or CODEX_AI_TIMEOUT_MS, for large subscription-CLI reviews.
For CLI providers, confirm the CLI binary and credential path are available.

REES is silent

A no-finding REES response can be intentionally invisible. For failures, search logs forreview_context_fetch_failed with contextType set to enrichment.

review_context_fetch_failed
rees_analyzer_config_invalid

Check REES enrichment for enablement and REES analyzer reference for analyzer names, network calls, and token requirements.

RAG returns no context

Confirm GITTENSORY_REVIEW_RAG=true and repo activation.
Confirm Qdrant or the vector backend is reachable from the app container.
Confirm the embedding endpoint and model are running.
Confirm the repo has been indexed after enabling the feature.

Queue stuck or dead jobs

Watch pending, processed, failed, and dead metrics. A high pending count can be webhook replay or maintenance work; dead jobs need direct investigation.

curl http://localhost:8787/metrics | grep gittensory_queue
docker compose logs gittensory | grep selfhost_job_dead

bash

GitHub rate-limit responses or admission deferrals

Two independent signals cover this: gittensory_github_rest_rate_limit_responses_total counts actual 403/429 responses from GitHub, and the gittensory_jobs_rate_limit_admission_deferred_total / gittensory_jobs_rate_limit_budget_deferred_total / gittensory_jobs_rate_limited_by_type_total counters track jobs the queue itself held back before making a request, to avoid tripping a limit. All three job-side counters carry the same three labels — kind (webhook or background), key_scope (installation, public, global, or other), and job_type (the queue job's type, e.g. agent-regate-pr) — so you can break a spike down to exactly which token pool and which job type is under pressure.

A short burst of deferrals is expected and self-resolving: the queue is deliberately trading a few seconds of delay to avoid a real 429. Treat it as a real problem only once it's sustained — which is exactly what GittensoryGitHubRateLimitResponses (real 403/429s observed) and GittensoryQueueRateLimitDeferralsHigh (a sustained deferral rate, not a blip) are tuned to alert on, rather than firing on every brief admission hold.

# Deferrals broken down by token pool and job type over the last 10m
sum by (key_scope, job_type) (rate(gittensory_jobs_rate_limit_admission_deferred_total[10m]))

# Is one key_scope (e.g. a single installation token) the bottleneck?
topk(5, sum by (key_scope) (rate(gittensory_jobs_rate_limit_budget_deferred_total[10m])))

# Real rate-limit responses from GitHub itself (not just internal deferrals)
sum(rate(gittensory_github_rest_rate_limit_responses_total[10m]))

promql

If a single key_scope=installation pool is consistently the bottleneck, the fix is usually spreading load across more installation tokens (fewer repos per installation) or raising the GitHub App's own rate-limit tier, not code changes here.

Low GitHub response-cache hit rate

gittensory_github_response_cache_total (REST) and gittensory_github_graphql_cache_total (GraphQL) both carry a result label — hit, miss, set, coalesced, bypassed, or error — and a class label identifying the endpoint family. A healthy cache should show most traffic as hit for endpoints that are read repeatedly in one review/maintenance pass (PR reads, check-run lookups); a low hit rate on those specific classes, not the overall average, is the useful signal.

# REST hit rate by endpoint class over the last 15m
sum by (class) (rate(gittensory_github_response_cache_total{result="hit"}[15m]))
/
sum by (class) (rate(gittensory_github_response_cache_total[15m]))

# GraphQL hit rate — same shape, separate metric
sum by (class) (rate(gittensory_github_graphql_cache_total{result="hit"}[15m]))
/
sum by (class) (rate(gittensory_github_graphql_cache_total[15m]))

promql

Qdrant / vector-store errors

gittensory_qdrant_errors_total carries an op label (upsert, query, or delete) so you can tell whether indexing or retrieval is failing. GittensoryQdrantErrorRateHigh fires on a sustained error ratio, not an isolated blip.

Confirm QDRANT_URL (e.g. http://qdrant:6333) is reachable from the app container and the qdrant Compose profile is running.
If Qdrant requires auth, confirm QDRANT_API_KEY is set and matches the Qdrant deployment's configuration.
A dimension-mismatch error means the existing gittensory collection (the fixed collection name self-host always uses) was created with a different embedding model than the one currently configured (AI_EMBED_MODEL). Recreating it — delete the collection and let the next index run recreate it at the current width — is the fix, but it temporarily removes ALL indexed RAG context for every repo until re-indexing completes, so treat it as a deliberate, disruptive step, not a routine one.

curl "$QDRANT_URL/collections/gittensory"
docker compose --profile qdrant ps qdrant

# Only after confirming a dimension mismatch is the actual cause:
curl -X DELETE "$QDRANT_URL/collections/gittensory"

bash

Orb export or relay problems

For brokered self-host deployments, gittensory_orb_events_exported_total and gittensory_orb_export_errors_total track the hourly outcome-export loop; GittensoryOrbExportErrorRateHigh fires on a sustained error ratio there. The pull-mode relay loop (for installations receiving events outbound from Orb) reports through gittensory_orb_relay_drains_total (result=events when it drained something, result=empty otherwise) and gittensory_orb_webhook_total (event + result labels) for what happened to each relayed event once enqueued locally.

If exports are failing but the relay itself looks healthy, the export loop's Sentry cron monitor (see Self-host operations) is the fastest way to confirm whether the loop is even running, before digging into the error counters.

AI provider circuit breaker keeps opening

Each AI provider (self-host AI_PROVIDER entries) has its own circuit breaker: after 3 consecutive failures it stops attempting real calls to that provider for 60 seconds, recorded as gittensory_ai_provider_circuit_open_total{provider="..."} (skipped calls) alongside gittensory_ai_provider_failures_total{provider="..."} (real failures). It self-heals automatically — there is no manual reset — but it will reopen immediately if the underlying problem is still there.

Search logs for circuit_open: provider "..." to confirm which provider tripped, and selfhost_ai_provider_failed_in_chain for the real error each failed attempt hit before the breaker opened.
A provider that keeps re-tripping after its cooldown almost always means a persistent problem, not a transient blip: an expired/invalid API key, a CLI binary missing from the image (see selfhost_ai_cli_missing at boot), or the endpoint being genuinely unreachable from the container.
GittensoryAiProviderCircuitOpen fires on any circuit-open event in a 15-minute window — a single trip during a real but brief outage is expected; a rule that keeps firing across multiple windows points at the persistent case above.

Grafana traces error or show no data

The trace path is app or smoke process → OTEL collector → Tempo → Grafana. Tempo is only started by the observability profile, and app traces are only emitted when OTEL_TRACES_EXPORTER includes otlp.

docker compose --profile observability ps tempo otel-collector grafana
docker compose logs --tail=80 tempo otel-collector grafana

# Send one synthetic span through the collector and read it back from Tempo.
npm run test:smoke:observability

bash

If the smoke command fails at otel-collector:4318/v1/traces, the collector is not reachable from the app container.
If it pushes successfully but cannot read tempo:3200/api/traces/<trace_id>, Tempo is unhealthy, not ingesting, or not sharing the Compose network.
If the smoke command passes but Grafana Explore fails, check the Tempo data source URL. It should point at http://tempo:3200, not the OTLP ingest ports.
For a temporary live debugging run, set OTEL_TRACES_SAMPLER_ARG=1 so every root trace is sampled, then lower it again after diagnosis.

Readiness fails

DB: Check DATABASE_URL or DATABASE_PATH, volume permissions, Postgres reachability, and migrations.
Migrations: Read startup logs for migration errors before recreating volumes.
Dependencies: If Qdrant or Postgres profiles are enabled, confirm those services are healthy first.

← PreviousBackup & scaling Next →Releases & images