Why Compaction Matters
Apache Iceberg is a file-based table format. As data lands in the lake - whether through batch jobs, streaming ingestion, or change data capture (CDC) - it often arrives in small files. Over time, these small files accumulate, leading to:- Excessive metadata growth (manifest files, manifest lists)
- Expensive planning and scanning (too many file reads per query)
- Unbalanced partition sizes (e.g., one partition with 10 files of 5MB, another with 1 file of 1GB)
- Increased S3 API costs
Configuration Modes
Ryft supports two modes for compaction: Automatic and Custom.Automatic Compaction (Self-tuning)
In automatic mode, Ryft continuously monitors table and partition health and dynamically decides when and how to run compaction. This includes:- Detecting file size skew or small file accumulation
- Learning from actual query patterns and write frequencies
- Avoiding unnecessary rewrites for old or rarely queried partitions
- Scaling resources dynamically based on table size and performance needs
Custom Configuration
Custom mode gives users full control over how compaction is executed. You can configure:- Compaction strategy: bin-packing, sorting or Z-ordering
- Target file size: e.g. 256MB, 512MB, 1GB
- Columns to sort by: when applying a sorting strategy, you must define the columns to sort by
Key Features
Adaptive File Size
For highly compressed data, a common pitfall in Iceberg is that compaction output files are smaller than the defined target size. Ryft automatically monitors output file sizes and adapts the target file size based on the actual compression ratio of the data. This ensures that compaction jobs produce files that are as close to the target size as possible. This can result in dramatically fewer files, and reduced storage sizes.Partition Selection
Ryft identifies the partitions that would benefit most from compaction and prioritizes them, to improve compaction efficiency and improve query performance. Ryft tracks:- Number of files per partition
- Average file size
- Query activity per partition
- Time since last compaction
Delete File Handling
Iceberg supports delete files (position and equality deletes) to enable row-level deletes without rewriting data. However, too many delete files can degrade query performance. Ryft compaction optimizes delete files by:- Rewriting deleted rows into new data files to eliminate excessive delete files
- Grouping and consolidating equality deletes into fewer, larger delete files
- Adapting to delete-intensive workloads, ensuring that delete files do not accumulate excessively
Resource Optimization
Compaction jobs run on optimized compute and automatically allocate CPU and memory resources based on the table size and workload characteristics. Data which is already optimized does not get rewritten, and unused data will not be prioritized for compaction to reduce resource usage. Compaction is optimized for performance and cost, and runs in parallel, adapting to the amount of tables available.Workload-Aware Compaction
Ryft optimizes compaction differently depending on the ingestion and access pattern, and differentiates between the following workload types:- Batch
- Streaming
- Change Data Capture (CDC)
- Delete-intensive workloads
- Hybrid
- Full table rewrites