Maintenance Actions
Snowpack runs five Iceberg maintenance operations. They always execute in the fixed order described below. This ordering is not arbitrary — each step depends on the outcomes of the previous steps.
Execution order
1. rewrite_data_files
Trigger: The table has small data files at or above the configured threshold
(snowpack.min_input_files).
Compacts small data files into fewer, optimally sized files using the binpack
strategy. This is the highest-impact action — most query performance
degradation comes from excessive small files forcing per-file overhead during
reads.
/* {"app":"snowpack","table":"my_database.my_table","action":"REWRITE_DATA_FILES"} */CALL lakehouse_dev.system.rewrite_data_files( table => 'my_database.my_table', strategy => 'binpack', options => map( 'target-file-size-bytes', '536870912', 'min-file-size-bytes', '402653184', 'min-input-files', '5', 'max-concurrent-file-group-rewrites', '50' ))2. rewrite_position_delete_files
Trigger: The table has position-delete files above the configured threshold.
Merges position-delete files back into their corresponding data files. Position deletes add read-time overhead because the engine must apply the delete vectors during every scan. Rewriting eliminates this cost.
/* {"app":"snowpack","table":"my_database.my_table","action":"REWRITE_POSITION_DELETE_FILES"} */CALL lakehouse_dev.system.rewrite_position_delete_files( table => 'my_database.my_table')3. rewrite_manifests
Trigger: The table has manifest files above the configured threshold.
Consolidates manifest files to reduce query planning time. Large tables can accumulate hundreds of manifests, each of which must be read during planning. Rewriting reduces the number of manifests the engine needs to open.
/* {"app":"snowpack","table":"my_database.my_table","action":"REWRITE_MANIFESTS"} */CALL lakehouse_dev.system.rewrite_manifests( table => 'my_database.my_table')4. expire_snapshots
Trigger: The table has snapshots older than snowpack.max_snapshot_age_days,
or this action is run after compaction to clean up the snapshots created by the
rewrite steps above.
Removes snapshots that are older than the retention threshold. Each expired snapshot’s metadata is dropped, and the files it exclusively references become eligible for orphan file removal in the next step.
/* {"app":"snowpack","table":"my_database.my_table","action":"EXPIRE_SNAPSHOTS"} */CALL lakehouse_dev.system.expire_snapshots( table => 'my_database.my_table', older_than => TIMESTAMP '2026-04-20 00:00:00', retain_last => 1)5. remove_orphan_files
Trigger: Always runs after snapshot expiration. This is a cleanup step, not independently triggered.
Deletes data files on S3 that are no longer referenced by any active snapshot. These orphan files are a byproduct of compaction and snapshot expiration — the old small files have been replaced by compacted files, and once the snapshots referencing them are expired, the physical files can be safely removed.
/* {"app":"snowpack","table":"my_database.my_table","action":"REMOVE_ORPHAN_FILES"} */CALL lakehouse_dev.system.remove_orphan_files( table => 'my_database.my_table', older_than => TIMESTAMP '2026-04-22 00:00:00')Why the order matters
The five actions form a pipeline where each step’s output feeds the next:
-
Rewrite first, clean up second. Compaction (
rewrite_data_files,rewrite_position_delete_files,rewrite_manifests) creates new snapshots and leaves the old data files in place. The old files are still referenced by existing snapshots, so they cannot be deleted yet. -
Expire before removing orphans.
expire_snapshotsdrops the old snapshots that still reference the pre-compaction files. Only after expiration do those files become orphans. -
Orphan removal last.
remove_orphan_filesscans for files on S3 that no active snapshot references. If it ran before snapshot expiration, it would miss files still held by soon-to-expire snapshots, leaving storage waste behind. Running it last ensures a clean sweep.
Reversing or interleaving these steps would either skip reclaimable storage or, worse, delete files that are still needed by active snapshots.
Spark cost tagging
Every SQL call is prefixed with a JSON comment consumed by the data platform’s Spark cost tracker:
/* {"app":"snowpack","table":"<db>.<tbl>","action":"<ACTION>"} */This comment is parsed by the cost_engine.parse_statement_metadata function in
the data-platform-core-infra repository’s Glue ETL pipeline. It enables
per-table cost attribution for all Snowpack-initiated Spark work. If the comment
shape changes, the parser in that repository must be updated in lockstep.