Destination Improvement: Optimize the queries written by HVR to load data to Google BigQuery
PlannedWe have some tables which has high volume of data and we were planning to do the partition of table so that complete table is not scanned during data load.
We analyzed the queries executed by Fivetran in BigQuery (e.g., DELETE, MERGE, UPDATE, etc.) and found that there was no significant column from the main table that could be used for partitioning.
However, we identified one potential candidate for partitioning the "sap_arch_deleted" column. This column is used in some of the queries generated by Fivetran, so we thought it could be effective for partitioning.
Upon further inspection, we found that this filter is applied in a subquery (as part of an EXISTS function), which prevents it from driving partitioning behavior on the main table.
Generated query from Fivetran:
delete from `zp-data-platform-stg.Quantiphi_UT.BSEG_4 bas_ where exists (select 1 from `sap-replication-prod.RAW_ECP.bseg__b` bur_ where (bas_.`MANDT`=bur_.`MANDT` or (bas_.`MANDT` is null and bur_.`MANDT` is null)) and (bas_.`BUKRS`=bur_.`BUKRS` or (bas_.`BUKRS` is null and bur_.`BUKRS` is null)) and (bas_.`BELNR`=bur_.`BELNR` or (bas_.`BELNR` is null and bur_.`BELNR` is null)) and (bas_.`GJAHR`=bur_.`GJAHR` or (bas_.`GJAHR` is null and bur_.`GJAHR` is null)) and (bas_.`BUZEI`=bur_.`BUZEI` or (bas_.`BUZEI` is null and bur_.`BUZEI` is null)) and bas_.`sap_arch_deleted`<>0 and bur_.`hvr_op`=1);
If we rewrite the query as below then partitions can be leveraged.
delete from `zp-data-platform-stg.Quantiphi_UT.BSEG_4` bas_
where
bas_.`sap_arch_deleted`<>0
AND exists (select 1 from `sap-replication-prod.RAW_ECP.bseg__b` bur_ where (bas_.`MANDT`=bur_.`MANDT` or (bas_.`MANDT` is null and bur_.`MANDT` is null)) and (bas_.`BUKRS`=bur_.`BUKRS` or (bas_.`BUKRS` is null and bur_.`BUKRS` is null)) and (bas_.`BELNR`=bur_.`BELNR` or (bas_.`BELNR` is null and bur_.`BELNR` is null)) and (bas_.`GJAHR`=bur_.`GJAHR` or (bas_.`GJAHR` is null and bur_.`GJAHR` is null)) and (bas_.`BUZEI`=bur_.`BUZEI` or (bas_.`BUZEI` is null and bur_.`BUZEI` is null)) and bur_.`hvr_op`=1);
-
Hi Samyak,
There is an argument that the BigQuery query optimizer should be smart enough to recognize the ability to use partitioning with the way HVR generated this query.
However, I also think that it would generally be cleaner if we uplifted the restriction on the driving table out of the sub query.
Let me discuss with engineering whether we can do this easily.
Thanks,
Mark. -
FYI Samyak engineering made code changes that are currently in QA. I look forward to releasing this improvement soon.
Thanks,
Mark. -
Hi Samyak,
We have made progress with some engineering work, and it would be probably good to plan some time to discuss those. Let me know when you are open to that.
Best regards,
Edwin
Please sign in to leave a comment.
Comments
3 comments