Overview
HVR’s recommended architecture (see below) features HVR agents – installations of the software that play the role of an agent – and the HVR hub. The HVR hub is the installation that controls data integration for one or more data flows (channels). The hub includes a scheduler that manages the jobs to keep data in sync. In order to operate, the hub’s scheduler must connect to a repository database.
Illustration 1: HVR’s recommended architecture using HVR agents
A typical setup includes many HVR agents, often running on or very close to every source system, and on or very close to every target system.
HVR’s software is modular and flexible. Any installation of the software can play the role of an HVR agent and/or the HVR hub. Some customers run the hub standalone with others combining the hub role with a source change data capture and/or target data delivery.
Two main considerations to determine where to run the hub include:
- resource consumption – related to sizing – and,
- availability, given no data is moving if the hub’s scheduler doesn’t run.
Local Processing Hub – Resource Consumption
HVR is designed to distribute work as much as possible. As a result, relatively resource- intensive processing is pushed to the HVR agent to perform, with the HVR hub performing as little processing as possible. The hub is however in charge of all the jobs that move data between sources and targets, and stores system state, to enable recovery without any loss of changes. At present all data flowing between sources and targets passes through the HVR hub, including data from a one-time load (refresh) and a detailed row-wise comparison.
The HVR hub needs resources to:
- Run the scheduler.
- Spawn jobs to perform one-time data movement (Refresh and Compare) andcontinuous replication (CDC and integration). In all cases the resource-intensive part of the data processing is pushed to an HVR agent to perform, including data compression, with the hub simply passing the data from source to target. For a data refresh or compare the data is simply passed through without touching disk. During normal CDC activity data is temporarily stored on disk to allow the quickest possible recovery, with capture(s) and integrate(s) running asynchronously for optimum efficiency. If data transfer is encrypted then the HVR hub decrypts the data, and encrypts it again (typically using different encryption certificates) as needed to deliver it to the target.
- Pass through compressed data from source to target. Thanks to strong compression– 5-10x compression on the actual data size is common – large amounts of data can be passed through without the need for very high network andwidth.
- Gather metrics from the log files to be stored in the repository database.
- Serve real-time process metrics to any Graphical User Interfaces (GUIs) connected to the hub. HVR runs as a service irrespective of any GUI connected, and real-time metrics are provided for monitoring purposes.
- Allow modifications to the setup in the design environment.
The most important resource for the HVR hub to function well is fast IO (in terms of IOPS, IOs Per Second), especially for the HVR_CONFIG location where runtime data is written and state is kept. To support CDC on a busy source system transaction files can be written to disk every second or two, with updates to the (tiny) capture state file at the same rate, as well as very frequent updates to the log files that keep track of the activity. With most customers running multiple channels there will be many small IOs into the CONFIG directory every second. A disk subsystem with a sizeable cache and preferably Solid-State Drives (SSDs) is a good choice for HVR hub storage.
Repository Database
The HVR hub stores metadata, a very limited amount of runtime data, as well as aggregated process metrics (statistics) in its repository database. The most important resource for the hub database is storage, with even quite modest needs in order to support a single hub (up to 5 GBs of database storage space allocated to the repository database can support virtually all hub setups). Traditionally customers stored their repository database local to the HVR hub, but more recently customers frequently use a database service to host the repository database away from the HVR hub. The main advantage of a local repository database is a lower likelihood that the database connection fails (resulting in all data flows to stop because the scheduler fails in such case) versus offloading any resources the repository requires with a database elsewhere.
High-Volume Agent (HVA) – Resource Consumption
With the HVR hub in charge of data movement between sources and targets the HVR agents perform the hard work.
Capture Agent
The HVR capture agent needs resources to perform the following functions:
- For one-time data loads (refreshes) and row-wise compares, retrieve data from the source database, compress it, optionally encrypt it and send it to the hub. For optimum efficiency data does not touch disk during such operations. Of- course the matching database session(s) that serve up the data may use a fair amount of database (and with that system) resources. Resource consumption for refreshes and compares is only intermittent.
- For bulk compare jobs the HVR Agent computes a checksum for all the data.
- To setup CDC, during Initialize, retrieve metadata from the database, and add table-level supplemental logging as needed.
- During CDC resources are needed to read the logs, parse them, and keep information about in-flight transactions in memory (until a threshold is reached and additional change data is written to disk). The amount of resources required for this task varies from one system to the other, depending on numerous factors, including:
◦ the log read method (direct or through an SQL interface),
◦ data storage for the logs (on disk or in for example Oracle Automatic Storage Manager),
◦ whether the system is clustered or not,
◦ the number of tables in the replication and data types for columns in these tables, and
◦ the transaction mix (ratio of insert vs. updates vs. deletes, and whether there are many small, short-running transactions versus larger, longer-running transactions).
To parse the log is generally the most CPU-intensive operation that can use up to 100% of a single CPU core when capture is running behind. HVR uses one log parser per database thread, and every database node in an Oracle cluster constitutes one thread.
For a real-world workload with the HVR Agent running on the source database server it is extremely rare to see more than 10% of total system resource utilization going to the HVR Agent during CDC, with typical resource consumption well below 5% of system resources.
For an Oracle source database HVR will periodically write the memory state to disk to limit the need to re-read archived log files to capture long-running transactions. Consider storage utilization for this if the system often processes large, long-running transactions.
Integrate Agent
The HVR integrate agent needs resources to perform the following tasks:
- Apply data to the target system, both during a one-time load (refresh) and during continuous integration. The resource utilization for this task varies a lot from one system to the other, mostly related to whether changes are applied in so-called burst mode or using continuous integration. Burst mode requires HVR to come up with a single net change per row per cycle so that a single batch insert, update or delete results in the correct end state for the row. For example, when, in a single cycle, a row is first inserted and followed by two updates then the net operation is an insert with the two updates merged with the initial data from the insert. This so-called coalesce process is both CPU and (even more so) memory intensive, with HVR spilling data to disk if memory thresholds are exceeded.
- Some MPP databases like Teradata and Greenplum use a resource-intensive client utility (TPT and gpfdist respectively) to distribute the data directly to the nodes for maximum load performance. Whilst arguably resource consumption for these utilities is not directly attributed to the HVR agent you must consider the extra load when sizing the configuration.
- For data compare the HVR integrate agent retrieves the data from the target system to either compute a checksum (bulk compare) or to perform row-wise comparison. Depending on the technologies involved HVR may, in order to perform the row-wise comparison, sort the data which is memory intensive and will likely spill significant amounts of data to disk (up to the total data set size).
- Depending on the replication setup the HVR integration agent may perform extra tasks like decoding SAP cluster and pool tables using the SAP Transform, or encrypt data using client-side AWS KMS encryption.
Many HVR customers consolidate data from multiple sources into one (or a few) target(s). With multiple if not many sources sending data to a target a lot of data has to be delivered by a single HVR integrate agent. Customers have used load balancers (both physical and software-based like AWS’s Elastic Load Balancer (ELB)) to manage integration performance from many sources into a single target by scaling out the HVR integrate agents.
Local Processing Hub – Availability
In the HVR architecture the hub controls and initiates the replication, and stores the state. Without the hub the replication does not work. Based on a catalog export of the HVR repository it is straightforward and relatively quick to re-instantiate a hub, but of course latency builds up during the time the hub is not available, and you will have to manage data overlap between the HVR hub that is no longer available and the one that was restored (for which HVR provides capabilities).
HVR supports running the HVR hub in a cluster using shared storage e.g. a Windows cluster or a cluster on Linux/Unix e.g. based on Oracle Clusterware (as part of Oracle RAC), Redhat Cluster or other general purpose cluster management software. For automatic failover HVR must be able to re-establish a database connection upon failover, so either the database must be part of the failover and run on the same machine, or the database can be remote. You may decide to co-locate the HVR hub with an already clustered database to take advantage of the high availability the cluster provides.
Sizing Guidelines for the Local Processing Hub
The most important factor impacting the HVR hub size is whether the hub also performs the role of a source and/or a target HVR agent. General recommendations include:
- Co-locate the hub with a production source database only if the server(s) hosting the production database has (have) sufficient available resources (CPU, memory, storage space and IO capacity) to support the HVR hub for your setup.
- HVR capture may run against a physical standby of the source database with no direct impact on the source production database. Mainly consider CPU utilization of the capture process(es) running on the source database in this case, knowing that for an Oracle RAC production database there will be one log parser per node in the source database, irrespective of the standby database configuration.
- Sorting data to coalesce changes for burst mode and to perform row-wise data compares (also part of a row-wise refresh) are CPU, memory and (temporary) storage space intensive.
- Utilities to populate the database target like TPT (for Teradata) and gpfdist (for Greenplum) can be very resource-intensive.
As the basis for sizing we use the number of a so-called channel, a concept HVR exposes, that for the purpose of this sizing document is simplified to mean a single source and a single target. In this context multiple channels may be feeding into the same target, or sourcing from the same source. The change rate mentioned in the sizing guide line below is the volume of transaction log changes produced by the database (irrespective of whether or not HVR captures all table changes from the source or only a subset).
Table: Rough Guidelines to Size the Hub
Review the guidelines and decide based on your situation what is the best hub configuration. For example:
- Your hub may capture changes for one of multiple sources, using agents for the other sources.
- One of your sources may be a heavily-loaded 8-node Oracle Exadata database that requires far more resources to perform CDC than a single mid-size SQL Server database.
- You may plan to run very frequent (resource intensive) compare jobs.
- Etc.
Test your assumptions if at all possible. Systems in the cloud can easily be changed in configuration to simplify your tests.