Scenario
A frequently asked question is how to refresh the ‘bseg’ table with hundreds of millions of rows from an SAP source to a target database. A regular refresh job usually is not enough in this case, it often fails with different errors (timeout, snapshot too old, etc…) or simply takes too much time to finish.
The ’bseg’ table is extracted from ’rfblg’ table cluster, usually has hundreds of columns and contains 800-900 million rows. In this example we will use two of its columns to do a ’manual’ selective refresh.
gjahr - standard field within SAP Cluster table bseg that stores Fiscal Year information.
https://www.se80.co.uk/saptabfields/b/bseg/bseg-gjahr.htm
belnr - standard field within SAP Cluster table bseg that stores Document Number information.
https://www.se80.co.uk/saptabfields/b/bseg/bseg-belnr.htm
Prerequisites
Add the cluster table to your channel as per the below documentation:
Based on the two columns (gjahr, belnr) we can find out how many slices we need exactly to create. My recommendation is rather to have several refresh jobs, about 20-30 million rows of data for each.
Column ‘gjahr’ filters by years, so let’s suppose we have data from 10 years back, about 80 million rows for each year if we have about the same amount of data for every year (sometimes not, it can be 100-150 million for a year).
‘belnr’ is a key (number) column, we can use the modulo function to distribute our dataset into more pieces. On an SAP Oracle source this would look like: mod((belnr), x) = y -> x the number of slices; y will be 0,1,2,3,..x-1.
We can also use one column for our distributed refresh jobs, but using both is more efficient for an initial refresh, since that way we can slice the table into smaller chunks.
Adding Restrict /RefreshCondition action
We can define how we handle the refresh in Restrict /RefreshCondition option:
/RefreshCondition - During refresh, only rows where the condition evaluates as TRUE are affected.
We can apply these settings only to our SOURCE group or both SOURCE and TARGET groups by selecting Group ‘*’ or separately a different expression for each groups (based on the requirements).
In case of an initial refresh (target table is empty at the beginning)
The best solution is only to select and filter the data on the source side and just insert the results to target side.
/RefreshCondition: gjahr = {hvr_var_jahr} and mod((belnr), 5) = {hvr_var_belnrslice}
/Context: slicing
Meaning of this restriction: HVR will extract the following dataset from the source side after generating a refresh job with given values and selecting this context (slicing):
select * from schema.table_name where gjahr=2020 and mod((belnr),5)=0;
(you might need different syntax in case of other type of databases)
NOTE: Refresh job truncates the target table at the beginning of the refresh process by default, so if we run multiple refresh jobs on the same table, then each of them will truncate what the previous refresh job inserted
To avoid the refresh jobs' truncation, we need to define a Restrict action with a false expression to our TARGET group, so it will run a delete statement instead of truncate but it won’t delete anything:
delete from schema.table_name where 1=0;
In case we only want to refresh/replace a selected part of the table (target table is already loaded)
We want to refresh only 2020 data from source to target:
Give a simple statement, selecting both groups (*), then schedule a refresh job with applying the context. It will pull the data from source and replace the corresponding part of the target table.
Setting up the refresh jobs:
Options:
‘Only resilience’ option is needed if you select a context.
We can also create the table on target side in the first refresh job if it does not exist yet.
Scheduling:
Jobs can take more hours to complete, so scheduling classic jobs and starting them from the Scheduler is suggested.
Contexts:
In case of an initial refresh (target table is empty at the beginning)
Since we used hvr variables in the refresh condition:
/RefreshCondition = gjahr = {hvr_var_jahr} and mod((belnr), 5) = {hvr_var_belnrslice}
we have two variables for which we have to provide a value.
The above example will create a refresh job for the following dataset:
select * from schema.table_name where gjahr=2015 and mod((belnr),5)=0;
Of course we want to cover the whole table, so need to create more jobs with the rest of the variations:
gjahr=2015 && blnrslice=0
gjahr=2015 && blnrslice=1
gjahr=2015 && blnrslice=2
gjahr=2015 && blnrslice=3
gjahr=2015 && blnrslice=4
gjahr=2016 && blnrslice=0
(…)
In case we only want to refresh/replacea selected part of the table (target table is already loaded)
We gave the condition without hvr variables, therefore we only need to select the context to apply it, there is no need to provide any value under “Variables”:
This way the following condition will be applied to the source:
select from schema.source_table where gjahr=2020
to the target:
delete from schema.target_table where gjahr = 2020.
Run the jobs from Scheduler
To avoid locks in databases, one refresh job per table can be running at the same time.
We can set up ‘quota_run’ attribute in Scheduler and set its value to 1, then start all the refreshes. This setting will grant that only 1 job will be running at the same time, but the next will be started automatically.