Troubleshooting
This page covers the most common failure modes, their symptoms, and how to resolve them.
Workers not scaling
Symptom: The job_queue table has unclaimed rows, but no worker pods are
spawning. kubectl get pods -n snowpack shows no worker pods.
Cause: The KEDA ScaledJob is not triggering. This usually means the postgresql trigger cannot connect to the database, or the trigger query is not returning the expected result.
Diagnosis:
-
Check the ScaledJob status:
Terminal window kubectl get scaledjob -n snowpackLook at the
READYcolumn. If it showsFalse, KEDA cannot evaluate the trigger. -
Check KEDA operator logs for connection errors:
Terminal window kubectl logs -n keda -l app=keda-operator --tail=50 -
Verify the
job_queuehas unclaimed work:SELECT COUNT(*) FROM job_queueWHERE claimed_at IS NULL AND visible_at <= NOW(); -
Verify
activationTargetQueryValueis set in the ScaledJob trigger metadata. KEDA 2.12+ requiresactivationTargetQueryValue(notactivationLagCount) to activate from zero replicas. Without it, KEDA will not scale up from zero even when there is work in the queue.
Resolution: Fix the KEDA trigger authentication (check the Secret referenced
by TriggerAuthentication), verify Postgres connectivity from the KEDA
namespace, and confirm the activationTargetQueryValue is present.
Jobs stuck in pending
Symptom: Jobs show status: pending for longer than expected. Workers
may or may not be running.
Cause: Several possible causes:
- KEDA polling interval is 30 seconds, so there is an inherent delay between a job being queued and a worker pod starting.
- The
visible_attimestamp on the queue row may be in the future (retry backoff). - A stale claim from a crashed worker may be blocking the row. The
reclaim_stalesweeper releases claims older than 30 minutes, but this requires the API process to be running.
Diagnosis:
-
Check queue row timestamps:
SELECT job_id, visible_at, claimed_atFROM job_queueWHERE claimed_at IS NULLORDER BY visible_at; -
Check for stale claims (claimed but not progressing):
SELECT job_id, claimed_atFROM job_queueWHERE claimed_at IS NOT NULLAND claimed_at < NOW() - INTERVAL '30 minutes'; -
Verify the API is running (the
reclaim_stalesweeper runs inside the API process):Terminal window kubectl get pods -n snowpack -l app.kubernetes.io/component=api
Resolution: If stale claims exist and the API is running, the sweeper will
reclaim them within 30 seconds. If the API is not running, fix the API first —
the sweeper cannot run without it. For jobs stuck behind a future visible_at,
wait for the backoff window to expire.
Health sync OOM
Symptom: The health-sync CronJob pod is OOMKilled. kubectl describe pod
shows the container exceeded its memory limit.
Cause: PyIceberg loads table metadata into memory. With high concurrency and many large tables, the combined memory footprint exceeds the pod’s limit. This was tracked in DL-278.
Diagnosis:
kubectl get pods -n snowpack -l app.kubernetes.io/component=health-sync --sort-by=.status.startTimekubectl describe pod <oom-killed-pod> -n snowpackLook for Last State: Terminated with Reason: OOMKilled and check the
memory limit in the container spec.
Resolution: Reduce the SNOWPACK_HEALTH_SYNC_CONCURRENCY setting. For the
dev environment the concurrency is set to 2 (down from the default 10). In
the Helm values:
healthSync: concurrency: 2 resources: limits: memory: 768MiIf the problem persists even at concurrency 2, increase the memory limit rather than lowering concurrency further — at concurrency 1 the sync window may exceed the 15-minute CronJob interval.
Table not appearing in orchestrator
Symptom: A table has snowpack.maintenance_enabled = true set as a table
property, but the orchestrator never submits maintenance for it.
Cause: The orchestrator only processes tables that satisfy all three conditions:
- The table’s database is listed in
healthSync.databases. - The table has
snowpack.maintenance_enabled = trueas a table property. - The table’s database is listed in
orchestrator.includeDatabases.
If any condition is not met, the orchestrator will skip the table silently.
Diagnosis:
-
Verify the table appears in the table cache:
Terminal window curl -s https://<snowpack-host>/tables?database=<database>&maintenance_enabled=true | jq .If the table is not in the response, health-sync has not picked it up.
-
Check
healthSync.databasesin the Helm values:Terminal window helm get values snowpack -n snowpack | grep -A5 healthSyncThe table’s database must be in this list.
-
Check
orchestrator.includeDatabases:Terminal window helm get values snowpack -n snowpack | grep -A5 orchestratororchestrator.includeDatabasesmust be a subset ofhealthSync.databases. If a database is in the orchestrator list but not in the health-sync list, the opt-in flags on new tables will never reach the table cache and the orchestrator will skip them.
Resolution: Add the database to both healthSync.databases and
orchestrator.includeDatabases in the Helm values, then terraform apply.
409 Conflict on maintenance submit
Symptom: POST /tables/{db}/{table}/maintenance returns 409 Conflict
with the message “Maintenance already in progress for {db}.{table}”.
Cause: Another job currently holds the lock for this table. Snowpack uses
a table_locks table to ensure only one maintenance job runs per table at a
time. The lock is acquired when a job is submitted and released when the job
completes, fails, or is cancelled.
Diagnosis:
-
Check who holds the lock:
SELECT table_key, holder, acquired_at, expires_atFROM table_locksWHERE table_key = '<database>.<table>'; -
Check the status of the holding job:
Terminal window curl -s https://<snowpack-host>/jobs/<holder-job-id> | jq .status
Resolution: If the holding job is still running, wait for it to complete.
If the holding job has already finished but the lock was not released (crash
during cleanup), the reclaim_stale sweeper will release it when the lock’s
expires_at has passed. To release a stale lock immediately, cancel the
holding job via POST /jobs/{id}/cancel.
Stale table cache
Symptom: The API returns outdated table lists, or newly opted-in tables are not appearing in API responses.
Cause: The table cache is populated by the health-sync CronJob, which runs every 15 minutes. If the CronJob has not completed recently, the cache may be stale.
Diagnosis:
Check the cache status endpoint for the last sync timestamp:
curl -s https://<snowpack-host>/tables/cache-status | jq .The response includes:
{ "last_synced": "2026-04-25T12:15:00+00:00", "table_count": 142}If last_synced is more than 15-20 minutes old, the health-sync CronJob may
be failing.
Resolution:
-
Check health-sync CronJob history:
Terminal window kubectl get cronjob -n snowpack -l app.kubernetes.io/component=health-synckubectl get jobs -n snowpack -l app.kubernetes.io/component=health-sync --sort-by=.status.startTime -
If the CronJob is failing, check pod logs for the most recent failed run:
Terminal window kubectl logs -n snowpack -l app.kubernetes.io/component=health-sync --tail=100 -
Common causes include OOM kills (see Health sync OOM above), Glue API throttling, or Postgres connection failures. Fix the underlying issue and the next CronJob run will repopulate the cache.