Destination Improvement: Optimize the queries written by HVR to load data to Google BigQuery
We have some tables which has high volume of data and we were planning to do the partition of table so that complete table is not scanned during data load.
We analyzed the queries executed by Fivetran in BigQuery (e.g., DELETE, MERGE, UPDATE, etc.) and found that there was no significant column from the main table that could be used for partitioning.
However, we identified one potential candidate for partitioning the "sap_arch_deleted" column. This column is used in some of the queries generated by Fivetran, so we thought it could be effective for partitioning.
Upon further inspection, we found that this filter is applied in a subquery (as part of an EXISTS function), which prevents it from driving partitioning behavior on the main table.
Generated query from Fivetran:
delete from `zp-data-platform-stg.Quantiphi_UT.BSEG_4 bas_ where exists (select 1 from `sap-replication-prod.RAW_ECP.bseg__b` bur_ where (bas_.`MANDT`=bur_.`MANDT` or (bas_.`MANDT` is null and bur_.`MANDT` is null)) and (bas_.`BUKRS`=bur_.`BUKRS` or (bas_.`BUKRS` is null and bur_.`BUKRS` is null)) and (bas_.`BELNR`=bur_.`BELNR` or (bas_.`BELNR` is null and bur_.`BELNR` is null)) and (bas_.`GJAHR`=bur_.`GJAHR` or (bas_.`GJAHR` is null and bur_.`GJAHR` is null)) and (bas_.`BUZEI`=bur_.`BUZEI` or (bas_.`BUZEI` is null and bur_.`BUZEI` is null)) and bas_.`sap_arch_deleted`<>0 and bur_.`hvr_op`=1);
If we rewrite the query as below then partitions can be leveraged.
delete from `zp-data-platform-stg.Quantiphi_UT.BSEG_4` bas_
where
bas_.`sap_arch_deleted`<>0
AND exists (select 1 from `sap-replication-prod.RAW_ECP.bseg__b` bur_ where (bas_.`MANDT`=bur_.`MANDT` or (bas_.`MANDT` is null and bur_.`MANDT` is null)) and (bas_.`BUKRS`=bur_.`BUKRS` or (bas_.`BUKRS` is null and bur_.`BUKRS` is null)) and (bas_.`BELNR`=bur_.`BELNR` or (bas_.`BELNR` is null and bur_.`BELNR` is null)) and (bas_.`GJAHR`=bur_.`GJAHR` or (bas_.`GJAHR` is null and bur_.`GJAHR` is null)) and (bas_.`BUZEI`=bur_.`BUZEI` or (bas_.`BUZEI` is null and bur_.`BUZEI` is null)) and bur_.`hvr_op`=1);
-
Hi Samyak,
There is an argument that the BigQuery query optimizer should be smart enough to recognize the ability to use partitioning with the way HVR generated this query.
However, I also think that it would generally be cleaner if we uplifted the restriction on the driving table out of the sub query.
Let me discuss with engineering whether we can do this easily.
Thanks,
Mark.
Please sign in to leave a comment.
Comments
1 comment