Skip to main content
Ryft’s Data Compaction optimizer is responsible for improving performance, reducing costs, and keeping your Apache Iceberg tables healthy by rewriting small, inefficient data files into larger, more optimal ones. This helps minimize metadata overhead, improve scan efficiency, and reduce the number of S3 list operations required for query planning and execution.

Why Compaction Matters

Apache Iceberg is a file-based table format. As data lands in the lake - whether through batch jobs, streaming ingestion, or change data capture (CDC) - it often arrives in small files. Over time, these small files accumulate, leading to:
  • Excessive metadata growth (manifest files, manifest lists)
  • Expensive planning and scanning (too many file reads per query)
  • Unbalanced partition sizes (e.g., one partition with 10 files of 5MB, another with 1 file of 1GB)
  • Increased S3 API costs
Compaction solves this by continuously merging small data and delete files into larger, more efficient files.

Configuration Modes

Ryft supports two modes for compaction: Automatic and Custom.

Automatic Compaction (Self-tuning)

In automatic mode, Ryft continuously monitors table and partition health and dynamically decides when and how to run compaction. This includes:
  • Detecting file size skew or small file accumulation
  • Learning from actual query patterns and write frequencies
  • Avoiding unnecessary rewrites for old or rarely queried partitions
  • Scaling resources dynamically based on table size and performance needs
There’s no need to configure anything - Ryft selects the optimal file size, compaction strategy, and scheduling frequency, based on real-world usage. This is the recommended mode of operation for most tables, unless you have specific requirements.

Custom Configuration

Custom mode gives users full control over how compaction is executed. You can configure:
  • Compaction strategy: bin-packing, sorting or Z-ordering
  • Target file size: e.g. 256MB, 512MB, 1GB
  • Columns to sort by: when applying a sorting strategy, you must define the columns to sort by

Key Features

Adaptive File Size

For highly compressed data, a common pitfall in Iceberg is that compaction output files are smaller than the defined target size. Ryft automatically monitors output file sizes and adapts the target file size based on the actual compression ratio of the data. This ensures that compaction jobs produce files that are as close to the target size as possible. This can result in dramatically fewer files, and reduced storage sizes.

Partition Selection

Ryft identifies the partitions that would benefit most from compaction and prioritizes them, to improve compaction efficiency and improve query performance. Ryft tracks:
  • Number of files per partition
  • Average file size
  • Query activity per partition
  • Time since last compaction
This ensures that hot partitions are kept in optimal shape, while cold partitions are left untouched unless necessary (e.g. for retention or tiering).

Delete File Handling

Iceberg supports delete files (position and equality deletes) to enable row-level deletes without rewriting data. However, too many delete files can degrade query performance. Ryft compaction optimizes delete files by:
  • Rewriting deleted rows into new data files to eliminate excessive delete files
  • Grouping and consolidating equality deletes into fewer, larger delete files
  • Adapting to delete-intensive workloads, ensuring that delete files do not accumulate excessively

Resource Optimization

Compaction jobs run on optimized compute and automatically allocate CPU and memory resources based on the table size and workload characteristics. Data which is already optimized does not get rewritten, and unused data will not be prioritized for compaction to reduce resource usage. Compaction is optimized for performance and cost, and runs in parallel, adapting to the amount of tables available.

Workload-Aware Compaction

Ryft optimizes compaction differently depending on the ingestion and access pattern, and differentiates between the following workload types:
  • Batch
  • Streaming
  • Change Data Capture (CDC)
  • Delete-intensive workloads
  • Hybrid
  • Full table rewrites