Ryft’s Data Compaction optimizer is responsible for improving performance, reducing costs, and keeping your Apache Iceberg tables healthy by rewriting small, inefficient data files into larger, more optimal ones.
This helps minimize metadata overhead, improve scan efficiency, and reduce the number of S3 list operations required for query planning and execution.
Apache Iceberg is a file-based table format. As data lands in the lake - whether through batch jobs, streaming ingestion, or change data capture (CDC) - it often arrives in small files. Over time, these small files accumulate, leading to:
In automatic mode, Ryft continuously monitors table and partition health and dynamically decides when and how to run compaction. This includes:
Detecting file size skew or small file accumulation
Learning from actual query patterns and write frequencies
Avoiding unnecessary rewrites for old or rarely queried partitions
Scaling resources dynamically based on table size and performance needs
There’s no need to configure anything - Ryft selects the optimal file size, compaction strategy, and scheduling frequency, based on real-world usage.
This is the recommended mode of operation for most tables, unless you have specific requirements.
For highly compressed data, a common pitfall in Iceberg is that compaction output files are smaller than the defined target size. Ryft automatically monitors output file sizes
and adapts the target file size based on the actual compression ratio of the data. This ensures that compaction jobs produce files that are as close to the target size as possible.
This can result in dramatically fewer files, and reduced storage sizes.
Ryft identifies the partitions that would benefit most from compaction and prioritizes them, to improve compaction efficiency and improve query performance.
Ryft tracks:
Number of files per partition
Average file size
Query activity per partition
Time since last compaction
This ensures that hot partitions are kept in optimal shape, while cold partitions are left untouched unless necessary (e.g. for retention or tiering).
Iceberg supports delete files (position and equality deletes) to enable row-level deletes without rewriting data. However, too many delete files can degrade query performance.Ryft compaction optimizes delete files by:
Rewriting deleted rows into new data files to eliminate excessive delete files
Grouping and consolidating equality deletes into fewer, larger delete files
Adapting to delete-intensive workloads, ensuring that delete files do not accumulate excessively
Compaction jobs run on optimized compute and automatically allocate CPU and memory resources based on the table size and workload characteristics.
Data which is already optimized does not get rewritten, and unused data will not be prioritized for compaction to reduce resource usage.
Compaction is optimized for performance and cost, and runs in parallel, adapting to the amount of tables available.