Skip to content
Self-hosting

Troubleshooting

Start with readiness and logs, then isolate webhook, queue, AI, REES, RAG, or write-suppression problems.

First checks

docker compose ps
docker compose logs --tail=200 gittensory
curl http://localhost:8787/ready
curl http://localhost:8787/metrics
bash

No review appears

Webhook
Check GitHub App deliveries and confirm /v1/github/webhook receives 2xx responses.
Allowlist
Confirm the repo is in GITTENSORY_REVIEW_REPOS for per-PR features.
Write mode
SELFHOST_DEPLOYMENT_MODE=dry-run or disabled suppresses writes even when review computes.
Policy
gate.aiReview.mode=off or commentMode=off can make AI/comment output intentionally quiet.

AI summary unavailable

  • Confirm AI_PROVIDER is set and supported.
  • Confirm the provider key or local endpoint works from inside the container.
  • Set the matching provider model env, such as ANTHROPIC_AI_MODEL, OPENAI_COMPATIBLE_AI_MODEL, OLLAMA_AI_MODEL, CLAUDE_AI_MODEL, or CODEX_AI_MODEL.
  • Increase the matching provider timeout env, such as CLAUDE_AI_TIMEOUT_MS or CODEX_AI_TIMEOUT_MS, for large subscription-CLI reviews.
  • For CLI providers, confirm the CLI binary and credential path are available.

REES is silent

A no-finding REES response can be intentionally invisible. For failures, search logs forreview_context_fetch_failed with contextType set to enrichment.

review_context_fetch_failed
rees_analyzer_config_invalid

Check REES enrichment for enablement and REES analyzer reference for analyzer names, network calls, and token requirements.

RAG returns no context

  • Confirm GITTENSORY_REVIEW_RAG=true and repo activation.
  • Confirm Qdrant or the vector backend is reachable from the app container.
  • Confirm the embedding endpoint and model are running.
  • Confirm the repo has been indexed after enabling the feature.

Queue stuck or dead jobs

Watch pending, processed, failed, and dead metrics. A high pending count can be webhook replay or maintenance work; dead jobs need direct investigation.

curl http://localhost:8787/metrics | grep gittensory_queue
docker compose logs gittensory | grep selfhost_job_dead
bash

GitHub rate-limit responses or admission deferrals

Two independent signals cover this: gittensory_github_rest_rate_limit_responses_total counts actual 403/429 responses from GitHub, and the gittensory_jobs_rate_limit_admission_deferred_total / gittensory_jobs_rate_limit_budget_deferred_total / gittensory_jobs_rate_limited_by_type_total counters track jobs the queue itself held back before making a request, to avoid tripping a limit. All three job-side counters carry the same three labels — kind (webhook or background), key_scope (installation, public, global, or other), and job_type (the queue job's type, e.g. agent-regate-pr) — so you can break a spike down to exactly which token pool and which job type is under pressure.

A short burst of deferrals is expected and self-resolving: the queue is deliberately trading a few seconds of delay to avoid a real 429. Treat it as a real problem only once it's sustained — which is exactly what GittensoryGitHubRateLimitResponses (real 403/429s observed) and GittensoryQueueRateLimitDeferralsHigh (a sustained deferral rate, not a blip) are tuned to alert on, rather than firing on every brief admission hold.

# Deferrals broken down by token pool and job type over the last 10m
sum by (key_scope, job_type) (rate(gittensory_jobs_rate_limit_admission_deferred_total[10m]))

# Is one key_scope (e.g. a single installation token) the bottleneck?
topk(5, sum by (key_scope) (rate(gittensory_jobs_rate_limit_budget_deferred_total[10m])))

# Real rate-limit responses from GitHub itself (not just internal deferrals)
sum(rate(gittensory_github_rest_rate_limit_responses_total[10m]))
promql

If a single key_scope=installation pool is consistently the bottleneck, the fix is usually spreading load across more installation tokens (fewer repos per installation) or raising the GitHub App's own rate-limit tier, not code changes here.

Low GitHub response-cache hit rate

gittensory_github_response_cache_total (REST) and gittensory_github_graphql_cache_total (GraphQL) both carry a result label — hit, miss, set, coalesced, bypassed, or error — and a class label identifying the endpoint family. A healthy cache should show most traffic as hit for endpoints that are read repeatedly in one review/maintenance pass (PR reads, check-run lookups); a low hit rate on those specific classes, not the overall average, is the useful signal.

# REST hit rate by endpoint class over the last 15m
sum by (class) (rate(gittensory_github_response_cache_total{result="hit"}[15m]))
/
sum by (class) (rate(gittensory_github_response_cache_total[15m]))

# GraphQL hit rate — same shape, separate metric
sum by (class) (rate(gittensory_github_graphql_cache_total{result="hit"}[15m]))
/
sum by (class) (rate(gittensory_github_graphql_cache_total[15m]))
promql

Qdrant / vector-store errors

gittensory_qdrant_errors_total carries an op label (upsert, query, or delete) so you can tell whether indexing or retrieval is failing. GittensoryQdrantErrorRateHigh fires on a sustained error ratio, not an isolated blip.

  • Confirm QDRANT_URL (e.g. http://qdrant:6333) is reachable from the app container and the qdrant Compose profile is running.
  • If Qdrant requires auth, confirm QDRANT_API_KEY is set and matches the Qdrant deployment's configuration.
  • A dimension-mismatch error means the existing gittensory collection (the fixed collection name self-host always uses) was created with a different embedding model than the one currently configured (AI_EMBED_MODEL). Recreating it — delete the collection and let the next index run recreate it at the current width — is the fix, but it temporarily removes ALL indexed RAG context for every repo until re-indexing completes, so treat it as a deliberate, disruptive step, not a routine one.
curl "$QDRANT_URL/collections/gittensory"
docker compose --profile qdrant ps qdrant

# Only after confirming a dimension mismatch is the actual cause:
curl -X DELETE "$QDRANT_URL/collections/gittensory"
bash

Orb export or relay problems

For brokered self-host deployments, gittensory_orb_events_exported_total and gittensory_orb_export_errors_total track the hourly outcome-export loop; GittensoryOrbExportErrorRateHigh fires on a sustained error ratio there. The pull-mode relay loop (for installations receiving events outbound from Orb) reports through gittensory_orb_relay_drains_total (result=events when it drained something, result=empty otherwise) and gittensory_orb_webhook_total (event + result labels) for what happened to each relayed event once enqueued locally.

If exports are failing but the relay itself looks healthy, the export loop's Sentry cron monitor (see Self-host operations) is the fastest way to confirm whether the loop is even running, before digging into the error counters.

AI provider circuit breaker keeps opening

Each AI provider (self-host AI_PROVIDER entries) has its own circuit breaker: after 3 consecutive failures it stops attempting real calls to that provider for 60 seconds, recorded as gittensory_ai_provider_circuit_open_total{provider="..."} (skipped calls) alongside gittensory_ai_provider_failures_total{provider="..."} (real failures). It self-heals automatically — there is no manual reset — but it will reopen immediately if the underlying problem is still there.

  • Search logs for circuit_open: provider "..." to confirm which provider tripped, and selfhost_ai_provider_failed_in_chain for the real error each failed attempt hit before the breaker opened.
  • A provider that keeps re-tripping after its cooldown almost always means a persistent problem, not a transient blip: an expired/invalid API key, a CLI binary missing from the image (see selfhost_ai_cli_missing at boot), or the endpoint being genuinely unreachable from the container.
  • GittensoryAiProviderCircuitOpen fires on any circuit-open event in a 15-minute window — a single trip during a real but brief outage is expected; a rule that keeps firing across multiple windows points at the persistent case above.

Grafana traces error or show no data

The trace path is app or smoke process → OTEL collector → Tempo → Grafana. Tempo is only started by the observability profile, and app traces are only emitted when OTEL_TRACES_EXPORTER includes otlp.

docker compose --profile observability ps tempo otel-collector grafana
docker compose logs --tail=80 tempo otel-collector grafana

# Send one synthetic span through the collector and read it back from Tempo.
npm run test:smoke:observability
bash
  • If the smoke command fails at otel-collector:4318/v1/traces, the collector is not reachable from the app container.
  • If it pushes successfully but cannot read tempo:3200/api/traces/<trace_id>, Tempo is unhealthy, not ingesting, or not sharing the Compose network.
  • If the smoke command passes but Grafana Explore fails, check the Tempo data source URL. It should point at http://tempo:3200, not the OTLP ingest ports.
  • For a temporary live debugging run, set OTEL_TRACES_SAMPLER_ARG=1 so every root trace is sampled, then lower it again after diagnosis.

Readiness fails

DB
Check DATABASE_URL or DATABASE_PATH, volume permissions, Postgres reachability, and migrations.
Migrations
Read startup logs for migration errors before recreating volumes.
Dependencies
If Qdrant or Postgres profiles are enabled, confirm those services are healthy first.