Maintenance Actions

Snowpack runs five Iceberg maintenance operations. They always execute in the fixed order described below. This ordering is not arbitrary — each step depends on the outcomes of the previous steps.

Execution order

1. `rewrite_data_files`

Trigger: The table has small data files at or above the configured threshold (snowpack.min_input_files).

Compacts small data files into fewer, optimally sized files using the binpack strategy. This is the highest-impact action — most query performance degradation comes from excessive small files forcing per-file overhead during reads.

/* {"app":"snowpack","table":"my_database.my_table","action":"REWRITE_DATA_FILES"} */
CALL lakehouse_dev.system.rewrite_data_files(
  table => 'my_database.my_table',
  strategy => 'binpack',
  options => map(
    'target-file-size-bytes', '536870912',
    'min-file-size-bytes', '402653184',
    'min-input-files', '5',
    'max-concurrent-file-group-rewrites', '50'
  )
)

2. `rewrite_position_delete_files`

Trigger: The table has position-delete files above the configured threshold.

Merges position-delete files back into their corresponding data files. Position deletes add read-time overhead because the engine must apply the delete vectors during every scan. Rewriting eliminates this cost.

/* {"app":"snowpack","table":"my_database.my_table","action":"REWRITE_POSITION_DELETE_FILES"} */
CALL lakehouse_dev.system.rewrite_position_delete_files(
  table => 'my_database.my_table'
)

3. `expire_snapshots`

Trigger: The table has snapshots older than snowpack.max_snapshot_age_days, or this action is run after compaction to clean up the snapshots created by the rewrite steps above.

Removes snapshots that are older than the retention threshold. Each expired snapshot’s metadata is dropped, and the files it exclusively references become eligible for orphan file removal in the next step.

/* {"app":"snowpack","table":"my_database.my_table","action":"EXPIRE_SNAPSHOTS"} */
CALL lakehouse_dev.system.expire_snapshots(
  table => 'my_database.my_table',
  older_than => TIMESTAMP '2026-04-20 00:00:00',
  retain_last => 1
)

4. `rewrite_manifests`

Trigger: The table has manifest files above the configured threshold.

Consolidates manifest files to reduce query planning time. Large tables can accumulate hundreds of manifests, each of which must be read during planning. Rewriting reduces the number of manifests the engine needs to open.

/* {"app":"snowpack","table":"my_database.my_table","action":"REWRITE_MANIFESTS"} */
CALL lakehouse_dev.system.rewrite_manifests(
  table => 'my_database.my_table'
)

5. `remove_orphan_files`

Trigger: Always runs after snapshot expiration. This is a cleanup step, not independently triggered.

Deletes data files on S3 that are no longer referenced by any active snapshot. These orphan files are a byproduct of compaction and snapshot expiration — the old small files have been replaced by compacted files, and once the snapshots referencing them are expired, the physical files can be safely removed.

/* {"app":"snowpack","table":"my_database.my_table","action":"REMOVE_ORPHAN_FILES"} */
CALL lakehouse_dev.system.remove_orphan_files(
  table => 'my_database.my_table',
  older_than => TIMESTAMP '2026-04-22 00:00:00'
)

Why the order matters

The five actions form a pipeline where each step’s output feeds the next:

Rewrite first, clean up second. Compaction (rewrite_data_files, rewrite_position_delete_files) creates new snapshots and leaves the old data files in place. The old files are still referenced by existing snapshots, so they cannot be deleted yet.
Expire before manifest rewrite and orphan removal. expire_snapshots drops the old snapshots that still reference the pre-compaction files. rewrite_manifests then consolidates the remaining manifest tree, and only after expiration do old files become orphans.
Orphan removal last. remove_orphan_files scans for files on S3 that no active snapshot references. If it ran before snapshot expiration, it would miss files still held by soon-to-expire snapshots, leaving storage waste behind. Running it last ensures a clean sweep.

Reversing or interleaving these steps would either skip reclaimable storage or, worse, delete files that are still needed by active snapshots.

Spark cost tagging

Every SQL call is prefixed with a JSON comment consumed by the data platform’s Spark cost tracker:

/* {"app":"snowpack","table":"<db>.<tbl>","action":"<ACTION>"} */

This comment is parsed by the cost_engine.parse_statement_metadata function in the data-platform-core-infra repository’s Glue ETL pipeline. It enables per-table cost attribution for all Snowpack-initiated Spark work. If the comment shape changes, the parser in that repository must be updated in lockstep.

Maintenance Actions

Execution order

1. rewrite_data_files

2. rewrite_position_delete_files

3. expire_snapshots

4. rewrite_manifests

5. remove_orphan_files