Skip to main content

Community

Destination Improvement: Liquid Clustering support for Databricks Delta table destination

Please sign in to leave a comment.

Comments

2 comments

    Resolving this bug is critical for enterprise customers.  Work arounds for this require needless full copies of source tables where "indexes" (of sorts) are added to the source tables.  Without it, JOINs can run ridiculously long and is so wasteful that adding clustering to those tables is just the only way to efficiently perform the processing needed within the timeline requirements.  As noted by Terence, the cluster detail should be readily ignored as completely irrelevant to the Fivetran update process since it's local metadata for Spark / Databricks to leverage internally.

    Adding to this. We synch the majority of data into Databricks via Fivetran (Microservices) and the Raw data is used by multiple team across the organization. Some of these source have very large tables that are joined and filtered multiple time depending on the use and are some of our most expensive operations.

    Databricks recently release Auto Liquid Clustering that is designed to help optimize query performance by dynamically clustering based on query patterns, and it was recommended by our account team to try it to optimize our query performance and manage resource utilization.

    Currently the only workaround is to replicate again the data that is already synched by Fivetran into another set of tables but it's both wasteful and expensive especially for very large tables. Given this new development and the significant time since liquid clustering has been GA with Databricks, I think it would be very helpful if Fivetran would consider looking into supporting this feature.