The following acronyms are used throughout this post:
- PK = Primary Key
- RK = Replication Key
- DK = Distribution Key
1) Is a PK mandatory for HVR replication?
* A: No. When a channel is built, HVR will inspect the source database for a good "Replication Key" (RK). If it sees a PK then that is used for its RK. If there is no PK it will look for a unique index as RK. If there are multiple, it uses a certain rule to decide which to use as RK. If no unique key exists then HVR will define a RK using all non-LOB datatypes. This last case is called an "implicit replication key". HVR also allows the RK to be defined or overridden manually, so it can (for example) conitain different columns from the PK in the source. Note that HVR ignores a table's non-unique indexes. On Oracle it also ignores unique indexes if that index only contain a single nullable column (because Oracle has a weak grasp of 'uniqueness').
2) If a table does not have a PK, how dows HVR handle the updates and deletes?
As described above, all HVR tables have a RK. This key will typically match the PK if that existed. Normally HVR will do an UPDATE or a DELETE by putting the RK in the SQL WHERE clause. However, if the RK is "implicit" *(see above) and the HVR parameter /DuplicateRows is defined then the updates will be performed instead as a DELETE (with the RK in the WHERE clause) and then an INSERT (see next question for more details on /DuplicateRows).
3): What effect does the HVR parameter /DuplicateRows have?
The parameter /DuplicateRows only affects tables that have an "implicit key" (see above). It has no effect on tables with an explicit RK, so it can safely be defined with \[table='*'\] to only affect tables with implicit RKs. The parameter /DuplicateRows should be used if a table can actually contain multiple identical rows (for many tables with no PK or Unique Key in the DBMS dictionary the application will ensure duplicate rows never really happen). In this case, /DuplicateRows will change the way how updates are integrated, i.e., updates are applied as a delete and an insert instead. Additionally, the SQL that HVR uses for deletes is affected. Unnecessary use of /DuplicateRows can have performance costs, but a missing /DuplicateRows can lead to errors (e.g. "update affected more than X rows expected").
4) Does HVR take all the columns in a table to define a distribution key?
For a distributed DBMS like Greenplum (for example), HVR has a concept of the Distribution Key (DK) for each table. This are the column(s) that HVR uses for hashing the storage when HVR creates its tables in the target DB. The DK should be defined manually using HVR parameter /DistributionKey. Choosing a good DK for each table is a manual process which we are not involved in. If no explicit DK is defined then HVR just uses the first column of the RK. This default rule is weak normally, and terrible for a SAP table, because the first column of the RK is typically "mandt" which is constant!
5) Does HVR create the target table in a distributed DBMS itself, and if so, how does it know which distribution key to use?
If Integrate /Burst is defined, HVR is used to create the DBMS target table, but they can also be created directly. When HVR creates the DBMS target table, it will use the DK as a "distribution key" for the distributed DBMS. Additionally, the table will have HVR's RK as index; either with or without uniqueness, depending on whether HVR parameter /DuplicateKeys is defined. HVR also needs a 'temp' table for delivering bursts of changes into its target tables. These are called __b. HVR always creates these itself. They don't get an index (not needed), but HVR will use the same DK for these as the target table. It is essential that the DK of this burst table matches the DK of the target table, otherwise HVR's burst-updates will be mis-aligned, which will either fail or be too slow. So if the DK of a target table is changed, the __b burst table must also be dropped and recreated with the new DK.