Key Concepts

Maintenance actions

Snowpack supports five maintenance actions, always executed in this order:

rewrite_data_files — Compacts small data files into fewer, optimally sized files. This is the most impactful action for query performance.
rewrite_position_delete_files — Merges position-delete files back into their corresponding data files, eliminating the read-time overhead of applying deletes.
expire_snapshots — Removes snapshots older than the retention threshold, freeing the metadata layer from tracking stale table states.
rewrite_manifests — Consolidates manifest files to reduce planning time for queries that scan large tables.
remove_orphan_files — Deletes data files on storage that are no longer referenced by any active snapshot.

The ordering matters: compaction runs before cleanup because orphan file removal relies on snapshots having already been expired. Removing orphan files before expiring snapshots would miss files that are still referenced by soon-to-expire snapshots.

Health analysis

Snowpack evaluates table health by inspecting Iceberg metadata for four key metrics:

Small file count — Number of data files below the target file size.
Snapshot count — Total snapshots retained by the table.
Manifest count — Number of manifest files in the current metadata.
Position delete files — Count of outstanding position-delete files.

Each metric is compared against configurable thresholds. When any metric exceeds its threshold, the table is flagged as needs_maintenance. Health data is available in two flavors:

Live — Fetched directly from the PyIceberg catalog (Glue/S3). Accurate but takes a few seconds per table.
Cached — Served from Postgres. Returns in roughly 1 ms, refreshed periodically by the health-sync process.

Opt-in model

Snowpack does not maintain tables by default. Onboarding is a two-step process:

Set the table property. A data engineer runs:

ALTER TABLE lakehouse_dev.my_database.my_table
  SET TBLPROPERTIES ('snowpack.maintenance_enabled' = 'true');

This signals that the table owner wants Snowpack to manage it.

Add the database to the orchestrator allowlist. A platform engineer adds the database name to the orchestrator.includeDatabases list in the Helm values file. This controls which databases the automated CronJob is permitted to process.

Both conditions must be met before the orchestrator will schedule maintenance for a table. This ensures that neither a table owner nor a platform operator can unilaterally enable maintenance — both must agree.

Job lifecycle

All maintenance operations in Snowpack are asynchronous. A job moves through these states:

Pending — The job has been accepted and queued for execution.
Running — Spark is actively executing the maintenance actions.
Completed — All requested actions finished successfully.
Failed — One or more actions encountered an error. Partial results may exist.
Cancelled — The job was cancelled before completion.

The typical flow: submit a maintenance request via POST and receive a 202 Accepted response with a job ID. Then poll GET /jobs/{id} to track progress. The orchestrator CronJob follows this same lifecycle automatically for all opted-in tables.