Data Compaction

Ryft’s Data Compaction optimizer is responsible for improving performance, reducing costs, and keeping your Apache Iceberg tables healthy by rewriting small, inefficient data files into larger, more optimal ones. This helps minimize metadata overhead, improve scan efficiency, and reduce the number of S3 list operations required for query planning and execution.

Why Compaction Matters

Apache Iceberg is a file-based table format. As data lands in the lake - whether through batch jobs, streaming ingestion, or change data capture (CDC) - it often arrives in small files. Over time, these small files accumulate, leading to:

Excessive metadata growth (manifest files, manifest lists)
Expensive planning and scanning (too many file reads per query)
Unbalanced partition sizes (e.g., one partition with 10 files of 5MB, another with 1 file of 1GB)
Increased S3 API costs

Compaction solves this by continuously merging small data and delete files into larger, more efficient files.

Configuration Modes

Ryft supports two modes for compaction: Automatic and Custom.

Automatic Compaction (Self-tuning)

In automatic mode, Ryft continuously monitors table and partition health and dynamically decides when and how to run compaction. This includes:

Detecting file size skew or small file accumulation
Learning from actual query patterns and write frequencies
Avoiding unnecessary rewrites for old or rarely queried partitions
Scaling resources dynamically based on table size and performance needs

There’s no need to configure anything - Ryft selects the optimal file size, compaction strategy, and scheduling frequency, based on real-world usage. This is the recommended mode of operation for most tables, unless you have specific requirements.

Custom Configuration

Custom mode gives users full control over how compaction is executed. You can configure:

Compaction strategy: bin-packing, sorting or Z-ordering
Target file size: e.g. 256MB, 512MB, 1GB
Columns to sort by: when applying a sorting strategy, you must define the columns to sort by

Key Features

Adaptive File Size

For highly compressed data, a common pitfall in Iceberg is that compaction output files are smaller than the defined target size. Ryft automatically monitors output file sizes and adapts the target file size based on the actual compression ratio of the data. This ensures that compaction jobs produce files that are as close to the target size as possible. This can result in dramatically fewer files, and reduced storage sizes.

Partition Selection

Ryft identifies the partitions that would benefit most from compaction and prioritizes them, to improve compaction efficiency and improve query performance. Ryft tracks:

Number of files per partition
Average file size
Query activity per partition
Time since last compaction

This ensures that hot partitions are kept in optimal shape, while cold partitions are left untouched unless necessary (e.g. for retention or tiering).

Delete File Handling

Iceberg supports delete files (position and equality deletes) to enable row-level deletes without rewriting data. However, too many delete files can degrade query performance. Ryft compaction optimizes delete files by:

Rewriting deleted rows into new data files to eliminate excessive delete files
Grouping and consolidating equality deletes into fewer, larger delete files
Adapting to delete-intensive workloads, ensuring that delete files do not accumulate excessively

Resource Optimization

Compaction jobs run on optimized compute and automatically allocate CPU and memory resources based on the table size and workload characteristics. Data which is already optimized does not get rewritten, and unused data will not be prioritized for compaction to reduce resource usage. Compaction is optimized for performance and cost, and runs in parallel, adapting to the amount of tables available.

Workload-Aware Compaction

Ryft optimizes compaction differently depending on the ingestion and access pattern, and differentiates between the following workload types:

Batch
Streaming
Change Data Capture (CDC)
Delete-intensive workloads
Hybrid
Full table rewrites

Getting Started

Integrations

Lakehouse Management

Administration

Why Compaction Matters

Configuration Modes

Automatic Compaction (Self-tuning)

Custom Configuration

Key Features

Adaptive File Size

Partition Selection

Delete File Handling

Resource Optimization

Workload-Aware Compaction

Getting Started

Integrations

Lakehouse Management

Administration

​Why Compaction Matters

​Configuration Modes

​Automatic Compaction (Self-tuning)

​Custom Configuration

​Key Features

​Adaptive File Size

​Partition Selection

​Delete File Handling

​Resource Optimization

​Workload-Aware Compaction

Why Compaction Matters

Configuration Modes

Automatic Compaction (Self-tuning)

Custom Configuration

Key Features

Adaptive File Size

Partition Selection

Delete File Handling

Resource Optimization

Workload-Aware Compaction