Destination Improvement: Liquid Clustering support for Databricks Delta table destination
Our team has started using Liquid Clustering on Databricks recently and while trying to sync the tables, we're getting errors regarding "Clustering Information" column.
The issue is basically as follows:
All the column names beginning with a hash(#) are not the actual column names of the provided table rather they are sub-heading names for other details of the table like # Clustering Information indicating the below data is related to columns involved in clustering and # Partition Information indicating the below info is related to the partition columns and # Detailed Table Information indicating other details of the table and this is expected and is differentiated with '#' in the beginning.
We would appreciate if Fivetran can prioritize addressing this issue for Delta Table destinations, thank you!
-
Resolving this bug is critical for enterprise customers. Work arounds for this require needless full copies of source tables where "indexes" (of sorts) are added to the source tables. Without it, JOINs can run ridiculously long and is so wasteful that adding clustering to those tables is just the only way to efficiently perform the processing needed within the timeline requirements. As noted by Terence, the cluster detail should be readily ignored as completely irrelevant to the Fivetran update process since it's local metadata for Spark / Databricks to leverage internally.
-
Adding to this. We synch the majority of data into Databricks via Fivetran (Microservices) and the Raw data is used by multiple team across the organization. Some of these source have very large tables that are joined and filtered multiple time depending on the use and are some of our most expensive operations.
Databricks recently release Auto Liquid Clustering that is designed to help optimize query performance by dynamically clustering based on query patterns, and it was recommended by our account team to try it to optimize our query performance and manage resource utilization.
Currently the only workaround is to replicate again the data that is already synched by Fivetran into another set of tables but it's both wasteful and expensive especially for very large tables. Given this new development and the significant time since liquid clustering has been GA with Databricks, I think it would be very helpful if Fivetran would consider looking into supporting this feature.
Please sign in to leave a comment.
Comments
2 comments