Partitioning on fivetran synced column while syncing the data to Iceberg from MySQL
AnsweredHi Team,
We use Fivetran to sync data from MySQL binlogs to Iceberg/S3.
Our request is to introduce partitioning on the fivetran_synced date column (for the relevant tables, based on user requirements). This would allow our Spark jobs to leverage partition pruning and read the data more efficiently, resulting in improved performance and optimized resource utilization.
As I understand it, the Fivetran team periodically runs compaction jobs to remove orphan files and consolidate multiple small files into larger ones for improved storage and query efficiency. Hence we can use this job as well to partition the data if required.
Please let us know if this can be supported or if there are any considerations we should be aware of.
Thanks,
-
Official comment
Thanks Megha for the request.
We have been looking at improvements on the read side improvements including partitioning. We are making improvements in how we sort data during Fivetran Syncs that should better sort data based on the Fivetran_synced value. Query engines like spark should then be able to use Parquet min/max values to prune at the file level.The intent is to make these changes transparent to make the overall management of your lake seamless.
Do you have data to show read side impact of a lack of partitioning?
If so, can we take the conversation offline. My email is casey.karst@fivetran.com-Casey
Please sign in to leave a comment.
Comments
1 comment