Destination Improvement: COPY INTO for Databricks
HVR5 was making use of the following steps:
Source -> temp ADLS -> Target Table
Second step was done without the use of Burst table but just using COPY INTO
HVR6 is doing:
Source -> temp ADLS -> burst table -> target table
This is leading to multiple problems:
1. Lower performance in an append only scenario Burst table is not required
2. COPY INTO is idempotent while burst is not (leading to duplicates that can impact repair feature)
3. Burst table is changing the order of the record since it handles change per type and based on time
-
Upvote
-
Ankit,
Thank you for submitting this request.
Integration into LDP's targets is uniform i.e. it works the same for all targets. The behavior in HVR 5 was different because of the use of the AgentPlugin for delivering into Databricks. With version 6.1 natively supporting Databricks the delivery is now identical to other platforms.
As it relates to idempotent behavior: Databricks does not currently support multi-statement transactions. This means that our default desire to keep a transactionally consistent image on the target cannot be maintained. However, individual statements are still idempotent like the copy into command. I recommend you set Integrate CommitFrequency to STATEMENT to inform LDP this is the behavior. It should then recovery properly without introducing duplicates.
Regarding order: the one and only reliable transaction order is the TimeKey column that we recommend you populate with {hvr_integ_seq}.
I will consider your enhancement as a general improvement for all LDP targets and put it on the backlog.
Thanks,
Mark. -
Hi,
Is there any update on the feature?The introduction of MERGE statement has an high cost impact that it is around 30-40%
Thanks
-
Hi Marco,
Thank you for your suggestion.
We are working on append optimizations generically and for other platforms. It looks like we may get to this in the 2nd half of calendar year 2024.
Thanks,
Mark.
Please sign in to leave a comment.
Comments
4 comments