Overview
While HVR's high-performance replication and data validation solution are both reliable and fault-tolerant, as with any component of your mission-critical data infrastructure you will want to take the necessary steps to make it highly available (HA). This article describes how to mitigate against and recover from downtime in your change data capture (CDC) pipelines when running in Amazon Web Services (AWS). While some prescriptions are provided and architectural considerations mentioned, this article is not intended to be an exhaustive step by step guide to designing HA or scalable replication architectures.
The primary area of focus for making HVR highly available is the HVR hub (explained below in the architecture overview section), which has both a file system and database component to consider. While there are numerous viable methods, products and services to make the HVR file system, database repository and networking highly available, this article focuses on HVR deployed in the Amazon Web Services (AWS) environment using native AWS features and services.
AWS Services Used When Making Local Data Processing Highly Available
-
Amazon CloudWatch. Used to monitor the health of servers and file systems upon which HVR runs. This service will also be used to initiate corrective actions.
-
Amazon CloudWatch Agent. Allows the collection of additional metrics (such as free disk space) from Amazon EC2 instances (and on-premises servers) that are not otherwise available to the CloudWatch service.
-
Amazon EBS. Enterprise Block Storage are the storage volumes that will be connected to the EC2 instances and contain the HVR installation, operating systems, and additional libraries.
-
Amazon EC2. Elastic Compute instances are where HVR software will run the HVR hub and optional remote stateless services.
-
Amazon Elastic IP. This will be used to remap the IP address to another instance in the event of a failover.
-
Amazon Elastic Load Balancer. Distributes HVR traffic across multiple Amazon EC2 instances. This will be used for both the scale and availability of the HVR stateless agent.
-
Amazon Lambda. A compute service that lets you run code on-demand without provisioning or managing servers.
-
Amazon RDS. The HVR hub repository will be located in an Amazon RDS instance.
-
Amazon Route 53. Domain Name System (DNS) service used to route network traffic to help perform DNS failover.
-
Amazon S3. The storage system will be used to store and retrieve snapshots of EBS volumes.
-
Amazon VPC. Virtual Private Connections will be used to logically isolate a section of AWS for security purposes.
Local Data Processing Architecture Overview
HVR requires a minimum of one installation designated as the HVR hub to enable the flow of changed data between any number of source and target data locations. Any given data pipeline, or channel, will have one HVR hub and this is the primary focus in the discussion of making HVR highly available.
Local Processing Hub
An HVR data channel can consist of one or many HVR installations, with one and only one of these installations designated as the hub for a given channel. The HVR hub can contain multiple channels, and a channel can serve as a data pipeline between two or more locations. The hub acts as the central command and control location for one or more data channels. It manages all aspects of replication from configuration to deployment, data queuing to auto-recovery, and historical statistics recording to real-time alerting. The hub-only architecture of HVR behaves similarly to the as a service model establishing remote capture connections to sources and remote integrate (apply changes) connections to targets directly from the hub.
Normally, the hub can be run directly on the source, target, or an independent location. When installed in AWS, the hub and all locations typically reside in the same Virtual Private Connection (VPC). Figure 1 shows the three possible architectures for the hub-only (agentless) configuration. In AWS deployments where the source and targets are in the same region, the first architecture is the most commonly deployed.
Figure 1. Options for the hub-only architecture.
The HVR hub stores information both on disk and in the backend database repository, or catalog. The database repository stores channel configurations, object metadata, replication status, and statistics. The disk stores runtime data including recovery state files for all locations and highly compressed queued data until those data are successfully delivered to all destinations.
Local Data Processing Stateless Capture and Integrate Agents
In enterprise hybrid-cloud and inter-regional environments, connecting data locations through the architecture of distributed HVR installations is common for superior throughput and security. In such scaled-out and highly secured environments, one HVR installation in a channel is designated as a hub, while all others are managed by the hub as stateless services or light agents. A single stateless agent can spawn multiple processes to be used for initial bulk loading, change data capture (CDC), data integration (apply), or as a secure data proxy, often set up in an on-premises DMZ (demilitarized zone). Communication between HVR installations is secured using TLS 1.2 with 256-bit AES encryption, and data are compressed, often in the 10-20-fold range, using custom compression designed for efficiency at high volumes. If stateless agents are not used, the hub will perform the capture and integrate duties as needed.
The stateless agents can be run directly on the source database server, target database server, or from their own independent location. Having a stateless agent as close to the source and target systems as possible will help boost performance and optimize security. For example, for data architectures that span on-premises and cloud, multi-cloud vendors, or regions within the same cloud vendor, having a stateless agent in the same data center or availability zone will prove much more performance than if distant remote connections were made directly to data locations from a remote location.
The agents should be in the same VPC as the hub and the location to which they connect. While having stateless agents on the source (usually for on-premises locations) or in the same region or availability zone provides tremendous advantages, this is not required. The HVR hub and all locations typically reside in the same Virtual Private Connection (VPC).
Figure 2 shows three possible architectures with a hub (always required) and one or two agents. This rounds out all six configuration types for a single source to a single target replication channel.
Note that a single HVR hub can consist of channels that connect hundreds of source and target locations using any combination of agent and agentless configurations simultaneously, supporting a vast array of scalability and security needs.
Figure 2. Options for a hub with stateless agent architecture.
High Availability for Local Processing Hub
Because the HVR hub controls all aspects of replication, including the state of any remote HVR service, it is central to the HA discussion. Steps and considerations for making this highly available follow.
High Availability for Local Processing Hub Repository
The HVR hub database repository contains replication configurations, metadata for replicated objects, the status of replication jobs, performance statistics, and user activity audit history. It is not included in the HVR installation, rather the hub repository database is installed independently of the HVR, and then HVR is configured to connect to it. The database can be local or remote with respect to the hub server. Figure 2 shows the options for the hub repository placement.
Figure 3. Options for the hub repository placement
In cloud environments, the repository is often a managed service and therefore located remotely from the hub server. The repository database and hub server can be in different regions provided that the database response time is within 200-300 milliseconds. It is not recommended to have the hub server and repository database split between on-premises and the cloud or between two cloud vendors.
The hub repository can be almost any database that HVR supports, including but not limited to MySQL/MariaDB, PostgreSQL, SQL Server, and Oracle. For a list of supported databases that can be used for the HVR hub repository, see the section: Installing and Upgrading HVR.
For a high level of availability in AWS deployed environments, managed databases such as Amazon RDS and Amazon Aurora databases are recommended. This reduces the additional hardware and administrative tasks needed to make the repository database highly available.
Figure 4 shows a typical Amazon RDS instance with a standby replica instance in the same region. The performance of the RDS instance is such that it does not have to be in the same availability zone, and the failover of this instance is automated by AWS and transparent to HVR.
Figure 4. HVR Repository HA when using Amazon RDS.
High Availability for Local Processing Hub Server
The HVR hub server maintains data on disk and communicates with the database repository as well as replication locations. These locations communicate either directly or via stateless agents running as a service or daemon, which sometimes requires third-party software (e.g. database clients or ODBC drivers) to connect to these data locations. The areas that need to be made highly available to the hub server are:
-
Disk volumes. Only one disk volume is required per HVR hub server so a single Amazon Elastic Block Store (EBS) can accommodate all directories needed for the HVR hub. HVR works directly from three directories: HVR_HOME, HVR_CONFIG, and HVR_TMP. In addition to these directories, HVR needs access to third-party libraries as well as the core operating system libraries. The directories and files therein that will reside on this EBS volume are:
-
HVR_HOME. Static HVR installation files on disk. The installation binary files and the HVR license file (hvr.lic) are contained here.
-
HVR_CONFIG. Dynamic data and other generated files. These include temporarily queued compressed data, runtime job files, process state and recovery information, and log files. HVR replication can always determine where it stopped if this directory and its files are available on the original or failover system.
-
HVR_TMP. If processing requires more RAM than what HVR has been configured to use, the data will spill to disk in this directory. If this directory is not defined, then HVR will create a sub-directory within the HVR_CONFIG directory structure.
-
Third-party connection libraries. The libraries are used by HVR to connect to replication locations as well as to the database repository. Examples: ODBC libraries, database clients, and security libraries.
-
OS library files. HVR installations are compiled in C and are sensitive to the OS type. Generally, there is only one build per OS, and that build can run on several major and minor versions of a given OS type.
-
-
Network access. Regardless of whether the hub connects directly to locations (sources and targets) or connects through a remote HVR stateless service, these connections will be made using IP addresses or DNS names. Therefore, appropriate permissions must be set for network routing and security access. This is often addressed by Amazon Elastic IP and Elastic Load Balancers.
Same Instance Recovery
Most failures do not require the failover of the HVR hub server. HVR's data queuing and state record keeping on the hub allow it to automatically handle source, target, and network outages, as well as killed processes and hard reboots. If a location becomes unavailable, by default HVR automatically retries the connection after 10 seconds and doubles the wait between retries until the wait time reaches 20 minutes. The wait time between retries is separately configurable for each location, and retries will continue forever. Once the connection is re-established, based on the save state information on the hub and in the target location, HVR will automatically resume where it was interrupted.
There are two runtime situations where replication can stop that require additional attention: full disk volumes and a stopped HVR Scheduler daemon.
Mitigating Against and Recovery From Full Disk Volumes
If the HVR_CONFIG directory has no space to temporarily queue compressed data, capture processes will timeout and enter retry mode. This effectively stops replication and can occur when capacity planning does not properly account for:
-
Extended target system outages during continuous replication, or
-
Longer than expected target system load times during the initial bulk load.
During these periods, change data is queued on the hub until the target is again available or the initial bulk load has been completed. Note that once the HVR integrate process begins applying the accumulated change data, the backlog of queued data will be automatically deleted from the hub. If the disk volume is full, then change data capture will stop. If the capture has stopped for a sufficient time, then at startup it may be necessary to rewind into the backup or archived transaction log files (from where change data capture reads changed data) of the source system. If those files are no longer available, then a gap in change data will occur. In this case, there are two options to ensure both source and target systems are in sync:
-
Restore the required backup or archived transaction log files, or
-
Perform a new bulk load into the target
The easiest way to avoid a full disk volume is to allocate enough space in advance. The amount of allocated space is the accumulation rate multiplied by the maximum time tolerance for not applying changes to the target. HVR will not apply changes to the target if:
-
Initial bulk data loading (refresh) is in progress for one or more tables in a channel.
-
If the target is offline
-
If the integrate process is suspended
The bulk load period depends on the amount of data being copied, the level of parallelization, network bandwidth, the ability of the source to unload the data, and the ability of the target to load the data. Running empirical tests on a subset of data and extrapolating those metrics is the best method for providing reasonable estimates of load times. The target can be offline or suspended for maintenance activities during a failover or if the database rejects changes and HVR has not been configured to handle such errors.
Simply running a capture-only test and accumulating change data on the hub for an extended duration is a valid method to determine the amount of space needed to queue data for a given period of target unavailability. This allows for determining the overall ratio of source transaction volume generation to HVR capture data volume. Common disk utilities can be used to measure the volume of data accumulated on a disk. Alternatively, and for continuous use, HVR monitors the amount of transient data that are queued on the hub as statistics in the repository.
If more disk space is required, the EBS volumes can be resized without downtime. The high-level steps are:
-
Request volume modification.
-
Monitor the progress of the volume modification.
-
Once the modification is completed, expand the volume's file system using OS commands.
For more detailed steps on expanding EBS volume, see the following articles:
Disk overflow situations can be monitored directly using custom cron shell (e.g. bash) scripts or Amazon CloudWatch alarms, as well as indirectly using native HVR latency alerts. Using cron to schedule shell script execution is the easiest way to detect the need for and then dynamically allocate additional disk space. Amazon CloudWatch alarm requires using the CloudWatch agent. HVR provides the capability to send latency alerts via Amazon SNS, SNMP, email, and Slack when the delay between source and target data exceeds a user-defined time threshold. Administrators can then take corrective action, such as allocating additional target resources or disk space to the EBS volume attached to the EC2 instance running the HVR hub.
Amazon CloudWatch alarms can be set for the EBS disk becomes full. Thresholds greater than 90% used or less than 20 GB free are reasonable settings in many situations. CloudWatch disk monitoring requires running the CloudWatch agent on an EC2 instance and collecting the disk_used_percent and disk_free metrics, respectively.
Read more about setting up the Amazon CloudWatch Agent:
The CloudWatch alarm cannot directly invoke the resizing of an EBS volume. Instead, resizing the EBS volume can be done by an administrator manually. When deemed necessary, or automatically using the AWS Systems Manager (SSM) Run Command or Lambda function triggered through a CloudWatch event.
For more detail on how to send a shell command to an EC2 instance see the following articles:
For examples of how to make Lambda functions execute a command on an EC2 instance see the following articles:
If the HVR_CONFIG directory becomes completely unrecoverable, then an older EBS snapshot will need to be recovered and attached. After that, replication will need to be re-initialized.
Recovery from Stopped Local Data Processing Scheduler Daemon/Service
Stopping the HVR Scheduler daemon or service on Windows will stop all replication processes. The HVR Scheduler can stop if an administrator issues a stop command if the process crashes or experiences a broken connection to the database repository due to a network timeout or a repository failover. Once the HVR Scheduler process is restarted, it will automatically start all replication processes that were in a Running state prior to the Scheduler stopped.
To ensure that the HVR Scheduler automatically starts if it stops for one of the above reasons, it should be run as a systemd process on Linux or as a service on Windows. The following steps show how to automatically restart the process on Linux and Windows.
For Linux:
-
Locate the service file for the HVR systemd process.
-
Add the following lines:
Restart=always RestartSec=3
-
After saving the file, reload the service:
systemctl daemon-reload
For Windows:
-
Locate the HVR Scheduler service in the Windows Services dialogue (services.msc).
-
Navigate to the Recovery tab and choose Restart the Service for the First, Second, and Subsequent failures.
-
Keep the defaults for Reset fail count after and Restart service after at 0 (zero) days and 1 minute, respectively. Windows does not support restarting after increments of less than one minute.
Figure 5. Windows services Recovery tab settings for the HVR Scheduler
Same Availability Zone Recovery
The same prescriptions for handling process and disk recovery also apply to EC2 instance failovers that are hosting the HVR hub server in the same availability zone. In addition, those prescriptions are extended to include Amazon CloudWatch, Amazon EBS, Amazon Elastic IP, and Amazon VPC.
The following recovery scenario is simple to set up but will incur some downtime while a new EC2 instance boots up and HVR processes perform automatic recovery. For a prescription that uses Active-Passive failover where both HVR hub instances are "warm" (running), see the following section Recovery From Availability Zone and Region Outages.
An HVR hub server failover can be initiated by CloudWatch after failing a system status check (StatusCheckFailed_System).
System status checks can register failures for the following reasons:
-
There is a loss of network connectivity
-
The system lost power
-
Software issues cause the host to be unreachable
-
Hardware issues cause the host to be unreachable
This will trigger a CloudWatch alarm action, which can be set to reboot or recover the instance. Instance recovery is preferred as it will allocate new CPU and RAM hardware in the same availability zone. The recovered instance will be identical to the primary instance, including the instance ID, private IP addresses, Elastic IP addresses, and all metadata.
The sequence of events for recovery are:
-
The Amazon EBS volumes are disconnected from the primary instance and attached to the recovery instance.
-
The Elastic IP service remaps the IP address from the primary instance to the failover instance.
-
The recovery instance is started.
-
All replication services start automatically and pick up where they left off
For additional details and considerations, including race conditions, for automatically recovering EC2 instances see the following articles:
Figure 6 shows the recovery of an HVR hub in the same Availability Zone using Amazon CloudWatch to initiate the recovery. The recovery EC2 instance uses the same EBS volume that was attached to the original instance.
Figure 6. Same Availability Zone recovery.
Recovery From Availability Zone and Region Outages
Although it is unlikely that an entire Amazon Availably Zone or Region will experience an outage, there are several ways to minimize the impact of such failures. The following prescription is for tolerating the least amount of downtime while isolating the HVR hub's data pipelines from the Internet using VPC.
In this setup, the HVR_CONFIG, HVR_HOME, and HVR_TEMP directories exist on the EFS system that is mounted using NFSv4. While the Amazon Route 53 health checkers are publicly available and they can only monitor hosts with IP addresses that are publicly routable on the Internet, the Route 53 health check record is associated with a private zone invoking a failover when the primary record is unhealthy. The subnet must be private and have access to the Internet for the EC2 resource that is monitored.
Figure 7 shows the flow as an Amazon CloudWatch event that calls the Lambda function, which in turn sends the metric to CloudWatch. Then, the Route 53 health check uses the CloudWatch alarm to invoke the failover routing. Because both EC2 instances share the same file system and since both EC2 instances are active, failover is almost instant. Any running HVR jobs will automatically be restarted by the HVR Scheduler and pick up where they left off.
Figure 7. Failover recovery between Availability Zones and Regions
Details, as well as an Amazon CloudFormation template, for creating the Amazon Route 53 health check on a VPC with a Lambda function and CloudWatch can be found in the following article: Performing Route 53 health checks on private resources in a VPC with AWS Lambda and Amazon CloudWatch.
If the HVR hub is available to the public Internet then see Active-Active and Active-Passive Failover.
HA for HVR Stateless Agents
Like most stateless services that run on Amazon EC2 instances, high availability (and scaling) can be achieved using an Amazon Elastic Load Balancer. This allows the hub to automatically connect to a different stateless agent should the agent or server, on which it runs, become unavailable. If needed, the HVR hub will automatically take care of any lower-level data pipeline recovery actions. This auto-recovery may include some situations where the capture or integrate process has transient data in memory that could temporarily spill to disk.
Figure 8 below shows an Amazon ELB distributing connections made to HVR stateless agents for high availability. While out of scope for this document, it is notable that this same method can also be used with auto-scaling groups to further scale HVR.
Figure 8. Method of failover of an HVR stateless agent using the Amazon Elastic Load Balancer
For steps on setting up an Amazon ELB, see Tutorial: Increase the Availability of Your Application on Amazon EC2, and note that the traffic will be TCP and the default port for the HVR stateless agent is 4343 (configurable). Note that the HVR agents do not require an HVR license file to operate. License files are required on the hub only.
Conclusion
While both reliable and fault-tolerant, the high-performance HVR replication and data validation solution can be used with AWS services to automate the next level of high availability. The HVR AMI image available in the Amazon Market Place, which can be used as an HVR hub and an HVR stateless agent, is a quick method to deploy replication services on EC2 instances. Additional Amazon services such as RDS, EBS, and EFS are the foundation for simplifying HVR hub's high availability. Used in conjunction with Amazon Lambda and Route 53, the HVR hub can failover even faster as well as across Availability Zones and Regions. The Amazon ELB helps ensure the stateless agents are always up and running at scale. Combined with Amazon CloudWatch and native HVR alerting, your data replication pipelines can deliver mission-critical data with minimal to no interruptions.