Destination Improvement: Native Compaction and Layout Optimization for Iceberg Destinations
AnsweredRequesting the implementation of asynchronous Data File Compaction (Bin-packing) and Delete File Merging for Iceberg destinations to align with Apache Iceberg standard maintenance capabilities (e.g., Spark rewriteDataFiles).
Currently, Fivetran’s Iceberg maintenance is limited to "removal of dead files" (Snapshot Expiration). While this manages metadata growth, it does not address the physical storage degradation inherent in high-frequency CDC (Change Data Capture) or Merge-on-Read workloads. This directly impacted our Fivetran ingestion during a recent incident where a full Historical Resync was required. The amount of file reads necessary was exceptionally high compared to what a well maintained Iceberg catalog would have been.
Our production tables are experiencing:
-
Small File Bloat: High-frequency merges result in thousands of small Parquet files, significantly increasing S3 GET request costs and slowing down query planning in Athena/Snowflake.
-
Delete File Fragmentation: The accumulation of "Equality Delete" files creates a massive performance penalty at read-time, as engines must merge these deletes in-memory for every query.
-
Sub-optimal Compression: Without a rewrite phase, data is not consistently compressed using modern codecs like ZSTD, leading to inflated S3 storage costs.
Requested Functionality
We request that Fivetran introduces a "Maintenance Window" or "Post-Sync Optimization" feature that performs the following:
-
Bin-Packing (Compaction): Consolidate small data files into optimized target sizes (e.g., 128MB or 512MB) to improve I/O efficiency.
-
Delete File Rewriting: Periodically merge "Delete Files" into the base data files to remove the "read-merge" penalty.
-
Codec Enforcement: Ability to specify a table-wide compression codec (e.g., ZSTD) that is applied during the compaction rewrite.
-
Sort Order Awareness: Support for "Locally Ordered" writes or Clustered distribution to improve partition pruning.
Business Value / ROI
-
Reduced AWS/Query Costs: Lower S3 storage footprint via better compression and significantly lower Athena/Snowflake "Data Scanned" costs.
-
Improved Query Latency: Faster BI dashboard performance for end-users due to optimized file counts.
-
First-Class Iceberg Support: Bringing Fivetran's managed service inline with the performance expectations of the Apache Iceberg ecosystem (Spark, Trino, and Flink).
-
Official comment
Thanks for submitting this feature request.
We are actively investigating this product feature area. File size optimization and data sorting are key capabilities to drive lower storage costs and faster queries as you mentioned above and is an area that we are continually looking to improve.-Casey
Please sign in to leave a comment.
Comments
1 comment