Optional Date-Based Partitioning for Connector File Outputs
AnsweredCurrently, Fivetran stores ingested files in a delta-style structure, which works well for replication purposes. However, it would be extremely valuable to have an optional setting across connectors to organize files using date-based partitioning.
For example, instead of only storing files in a delta format, allow customers to define a structure such as:
source/year/month/day/files.parquet
or
source=<source_name>/year=/month=/day=/files.parquet
This would provide major benefits for downstream processing and analytics platforms such as Spark, Athena, Redshift Spectrum, Snowflake, Databricks, and dbt workflows, improving:
-
Query performance
-
Partition pruning
-
Lifecycle management
-
Cost optimization
-
Data lake organization standards
Ideally, this could be an optional configuration at the connector or destination level.
-
Official comment
Thanks Henrique for the request.
In your thinking, would you want the time based partitioning based on the fivetran_sync time or a user defined time value in the dataset?
For most of our workloads we have observed a natural ordering of data based on fivetran_sync time. This results in parquet min/max statistics being generated that are then used by query engines to file prune. Curious if you are seeing any of this with your query engine.-Casey
Please sign in to leave a comment.
Comments
1 comment