Skip to content

Project Structure

Snowpack is organized as a layered Python application. This page maps every top-level module in the snowpack/ package and explains how the layers fit together.

Module map

ModuleResponsibility
api.pyFastAPI app — all HTTP endpoints + lifespan.
cli.pyStandalone snowpack CLI commands; bypasses the API queue.
worker.pyKEDA-invoked maintenance worker (one job per pod).
jobs.pyJobStore — Postgres-backed queue + state (DL-197 fence).
locks.pyTableLock — per-table ownership-checked lock.
table_cache.pyTableCache + TableCacheSyncWorker (atomic-swap refresh).
history.pyHistoryStore — schema management + persistent reads.
backend.pyselect_job_store / select_table_cache factory.
metrics.pyOTel/Prometheus gauges (queue depth, workers, etc.).
config.pyPydantic CompactionConfig — env-driven configuration.
discovery.pyPolarisDiscovery — REST catalog table listing.
catalog.pyPyIceberg catalog factories (Polaris + Glue).
analyzer.pyTableAnalyzer — produces HealthReport.
maintenance.pyMaintenanceRunner — executes one action via Spark.
spark.pySparkQueryEngine — Thrift/Kyuubi wrapper.
service.pyCompactionService — request-scoped service composition.
orchestrator.pyAuto-submit maintenance based on health.
health_sync.pyShared PyIceberg health precomputation worker.
health_sync_job.pyCronJob entrypoint for one health-sync cycle.

Layered architecture

Snowpack follows a strict layered architecture. Each layer only calls into the layer directly below it, keeping concerns separated and individual components testable in isolation.

Entry points

There are three ways to interact with Snowpack:

  • CLI (cli.py) — Click commands for ad-hoc operations. The CLI talks directly to the service layer; it does not go through the API.
  • REST API (api.py) — FastAPI endpoints for programmatic access and the orchestrator. All job submission and health queries flow through here.
  • Web UI — An Alpine.js single-page application served by the API. The UI is a pure API consumer and introduces no additional backend logic.

Service layer

All three entry points converge on CompactionService (service.py), which composes the lower-level components for a single request:

  • Discovery (discovery.py) — lists tables from the Polaris REST catalog.
  • Analyzer (analyzer.py) — loads Iceberg metadata and produces a HealthReport for a table.
  • MaintenanceRunner (maintenance.py) — executes a single maintenance action (compaction, snapshot expiry, etc.) by issuing Spark SQL.

Query engine

At the bottom of the stack sits the QueryEngine protocol (spark.py). This is a minimal, backend-agnostic interface that the service layer uses to run SQL. The concrete implementation, SparkQueryEngine, wraps a PyHive/Thrift connection to Spark Thrift Server (or Kyuubi).

Data flow summary

CLI (Click) / REST API (FastAPI) / Web UI (Alpine.js)
|
CompactionService
|
+--------------+--------------+
| | |
Discovery Analyzer MaintenanceRunner
(Polaris) (PyIceberg) (Spark SQL)
| | |
+--------------+--------------+
|
QueryEngine Protocol
|
SparkQueryEngine
(PyHive / Thrift)

State management — job queue, table locks, health cache, and run history — is handled entirely through Postgres via the jobs.py, locks.py, table_cache.py, and history.py modules.