GitHub Enterprise includes support for a high availability mode of operation designed to minimize service disruption in the event of hardware failure or major network outage affecting the primary appliance.

In this configuration, a fully redundant secondary GitHub Enterprise appliance is kept in sync with the primary appliance via replication of all major datastores. If service is disrupted, the replica appliance may be activated and failed over to within five minutes. The failover process must be initiated manually.

Notes:

  • Currently, GitHub Enterprise supports a single replica appliance.
  • The time required to fail over will depend on how long it takes to promote the replica to primary status and redirect traffic to it.

Overview

  • Fully redundant GitHub Enterprise appliance.
  • Automated setup of one-way, asynchronous replication of all datastores (Git repositories, MySQL, Redis, and Elasticsearch) from primary to replica appliance.
  • Active/Passive configuration. The replica appliance runs as a warm standby with database services running in replication mode but application services stopped.
  • Failover requires promoting the replica to primary status by disabling replication services and bringing application services online.
  • DNS failover.
  • Expected failover time: 2 - 10 minutes.

Targeted failure scenarios

A High Availability configuration is suitable for protection against:

  • Software crashes, either due to operating system failure or unrecoverable applications.
  • Hardware failures, including storage hardware, CPU, RAM, network interfaces, etc.
  • Virtualization host system failures, including unplanned and scheduled maintenance events on AWS.
  • Logically or physically severed network, if the failover appliance is on a separate network not impacted by the failure.

A High Availability configuration is not suitable for use in these scenarios:

  • Scale-out, including geo-distributed Git read mirrors. Serving application or Git requests from a replica appliance is not yet supported.
  • Backups and DR. An HA replica does not replace the need for off-site backups in your disaster recovery plan. Some forms of data corruption or loss may be replicated immediately from primary to replica. To ensure safe rollback to a known good past state, you must perform regular backups with historical snapshots.
  • Zero downtime upgrades. To prevent data loss and split-brain situations in controlled promotion scenarios, place the primary appliance in maintenance mode and wait for all writes to complete before promoting the replica.

Network traffic failover strategies

The replication capabilities built into GitHub Enterprise handle keeping a replica appliance's datastores in sync with the primary appliance and allow promoting a replica appliance to primary mode. However, you must separately configure and manage redirecting network traffic from primary to replica during failover.

DNS failover

DNS failover is an approach that is simple to set up and avoids the need for additional network components. It works equally well across datacenters and networks as it does within the same network.

With DNS failover, you should use short TTL values in the DNS records that point to the primary GitHub Enterprise appliance. We recommend a TTL between 60 seconds and five minutes.

During failover, you must place the primary into maintenance mode (if available) and redirect its DNS records to the replica appliance's IP address. The time needed to redirect traffic from primary to replica will depend on the TTL configuration and time required to update the DNS records.

Replica appliance provisioning and SSH access

Use the GitHub Enterprise command line utilities to:

  • Replicate all major datastores.
  • Monitor replication status.
  • Perform failover.

Provision a new GitHub Enterprise 2.6 appliance to act as the replica

Provision a new appliance for your desired platform. The replica appliance specs should mirror the primary appliance's in CPU, RAM, and storage.

  • If you use AWS, we strongly recommend booting the new appliance in a separate availability zone.
  • If you use VMware, provision the virtual machine under a separate ESXi host environment or virtual datacenter, if available. Ideally, all underlying hardware, software, and network components should be isolated from those of the primary appliance.

Upload license and set admin password and key

Note: You may skip this step if you already have SSH access to the replica via an EC2 Key Pair.

  1. In a browser, navigate to the new replica appliance's IP address and upload your GitHub Enterprise license.

  2. Set an admin password that matches the password on the primary and continue.

  3. Choose New Install when prompted to set up a New Install or Migrate from another appliance.

  4. On the settings page, add an SSH key to the list of Authorized SSH keys. This key will be used to:

    • SSH to the replica appliance for initial setup.
    • Monitor replication status.
    • Perform manual failover (if necessary).

No additional configuration is necessary, as all other settings will be copied from the primary appliance during replica setup.

Connect to the replica appliance over SSH

You should now be able to open a terminal and connect to the replica appliance's IP address using the administrative port and user as follows:

   $ ssh -p 122 admin@169.254.1.2

At this point, you should have a new GitHub Enterprise appliance up and running with SSH access for running the replication utilities.

Utilities for replication management

The replication system is managed via a small set of utilities executed on the replica appliance.

The following examples assume you've connected to the replica appliance over SSH as the admin user as shown in the previous section.

ghe-repl-setup

The ghe-repl-setup command puts a GitHub Enterprise appliance into replica / warm standby mode.

  • An encrypted openvpn tunnel is configured for communication between the two appliances.
  • Database services are configured for replication and started.
  • Application services are disabled. Attempts to access the replica appliance over HTTP, Git, or other supported protocols will result in a "instance in replica mode" maintenance page or error message.

To enable replica mode, run the ghe-repl-setup command, providing the IP address of the primary appliance as follows:

admin@169-254-1-2:~$ ghe-repl-setup 169.254.1.1
Generating rsa key pair for communication with primary GitHub instance.
The primary GitHub Enterprise instance must be configured to allow replica access.
Visit http://169.254.1.1/setup/settings and authorize the following SSH key:
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCzear28WOySkn/bYoZpwacsLzkWWZDGqQLOC5bEXz50moR+BvfL1sdY/3i+KZUt0WOFcImXyDXPV1LKYVpY759ObXsibryq0C7DbdMsVb7s9/DLWfB8b7qYchzJY/rE9hUzIXeEccgtF0gYNPH/oTDR+CP/e49Goys0zZ8Axso2zKpINMexzOblJ6eJRivQcp6qVZt9NtGikc/QDeIc4tHwIKQTU5DijaCv+97CivH2q0R9epS86PpeH9usA0nOn7mxmGgfJpLIvbRGhjG44vWXewrr0GkOsKrrp8OC79lkR3VKqU/i6+wk2PYqFpLFCcPXw3Ru4VHcKaSqNUJ6mKH ha-replica-ip-169-254-1-2
Run `ghe-repl-setup 169.254.1.1' once the key has been added to continue replica setup.

On first invocation, the ghe-repl-setup command generates a new RSA key pair for SSH communication with the primary appliance and writes the public key to standard output. Add the key provided to the primary appliance's Authorized SSH keys and run the ghe-repl-setup command again:

admin@169-254-1-2:~$ ghe-repl-setup 169.254.1.1
Verifying ssh connectivity with 169.254.1.1 ...
Connection check succeeded.
Configuring database replication against primary ...
Success: Replica mode is configured against 169.254.1.1.
To disable replica mode and undo these changes, run `ghe-repl-teardown'.
Run `ghe-repl-start' to start replicating against the newly configured primary.

The appliance is in replica mode with verified connectivity between appliances, but actual replication has not started.

ghe-repl-start

The ghe-repl-start command is used to turn on active replication of all datastores:

admin@169-254-1-2:~$ ghe-repl-start
Starting OpenVPN tunnel ...
Starting MySQL replication ...
Starting Redis replication ...
Starting Elasticsearch replication ...
Starting Pages replication ...
Starting Git replication ...
Success: replication is running for all services.
Use `ghe-repl-status' to monitor replication health and progress.

Once replication is started, each datastore will begin transferring data from the primary to the replica. On a new replica, this may take some time. Once the initial dataset is copied, new changes will be replicated in near real time.

ghe-repl-status

The ghe-repl-status command can be used to check on the status of each datastore's replication channel.

For each datastore, the replication stream may be in an OK, WARNING or CRITICAL state.

admin@169-254-1-2:~$ ghe-repl-status
OK: mysql replication in sync
OK: redis replication is in sync
OK: elasticsearch cluster is in sync
OK: git data is in sync (10 repos, 2 wikis, 5 gists)
OK: pages data is in sync

When any of the replication channels are in a WARNING state, the command will exit with the code 1. Similarly, when any of the channels are in a CRITICAL state, the command will exit with the code 2.

Additional details about each datastore's replication state are available via the -v and -vv options.

$ ghe-repl-status -v
OK: mysql replication in sync
  | IO running: Yes, SQL running: Yes, Delay: 0
OK: redis replication is in sync
  | master_host:169.254.1.1
  | master_port:6379
  | master_link_status:up
  | master_last_io_seconds_ago:3
  | master_sync_in_progress:0
OK: elasticsearch cluster is in sync
  | {
  |   "cluster_name" : "github-enterprise",
  |   "status" : "green",
  |   "timed_out" : false,
  |   "number_of_nodes" : 2,
  |   "number_of_data_nodes" : 2,
  |   "active_primary_shards" : 12,
  |   "active_shards" : 24,
  |   "relocating_shards" : 0,
  |   "initializing_shards" : 0,
  |   "unassigned_shards" : 0
  | }
OK: git data is in sync (366 repos, 31 wikis, 851 gists)
  |                   TOTAL         OK      FAULT    PENDING      DELAY
  | repositories        366        366          0          0        0.0
  |        wikis         31         31          0          0        0.0
  |        gists        851        851          0          0        0.0
  |        total       1248       1248          0          0        0.0
OK: pages data is in sync
  | Pages are in sync

ghe-repl-stop

The ghe-repl-stop command is used to disable replication for all datastores, either temporarily or permanently. This command only stops the replication services.

admin@168-254-1-2:~$ ghe-repl-stop
Stopping Pages replication ...
Stopping Git replication ...
Stopping MySQL replication ...
Stopping Redis replication ...
Stopping Elasticsearch replication ...
Stopping OpenVPN tunnel ...
Success: replication was stopped for all services.

Replication may be restarted at any time with the ghe-repl-start command and will resume from the previous replication position for all datastores.

ghe-repl-promote

The ghe-repl-promote command disables replication and converts the replica appliance into a primary. The appliance is configured with the same settings as the original primary and all services are enabled so the appliance can start serving user reads and writes.

admin@168-254-1-2:~$ ghe-repl-promote
Enabling maintenance mode on the primary to prevent writes ...
Stopping replication ...
  | Stopping Pages replication ...
  | Stopping Git replication ...
  | Stopping MySQL replication ...
  | Stopping Redis replication ...
  | Stopping Elasticsearch replication ...
  | Stopping OpenVPN tunnel ...
  | Success: replication was stopped for all services.
Switching out of replica mode ...
  | Success: Replication configuration has been removed.
  | Run `ghe-repl-setup' to re-enable replica mode.
Applying configuration and starting services ...
Success: Replica has been promoted to primary and is now accepting requests.

Initiating a failover to your replica appliance

  1. Put the primary into maintenance mode to allow replication to catch up before you switch to the replica appliance.
  2. When the number of active Git operations reaches zero, wait an additional 30 seconds.
  3. On the replica appliance, run ghe-repl-status -vv. Verify that all replication channels report OK.
  4. To stop replication and to tell the replica appliance that it has been promoted to primary, run ghe-repl-promote.
  5. If you are using DNS, update the DNS record to point to the IP address of the replica. After the TTL period elapses, the replica will become the primary appliance.
  6. Once DNS has been updated, notify users that they can resume normal operations.

Recovering HA post-failover

After failing over, you will be relying on a single appliance and will want to regain redundancy as soon as possible.

  • If the failover was planned or was not related to the health of the primary appliance, you can likely use the former primary as the new replica appliance.
  • If the failover was related to an issue with the primary appliance, you may prefer to start with a fresh replica appliance.

Make the former primary the new replica

There are a number of reasons to fail over that are not related to the health of the primary appliance. If you initiate a planned failover and want to reuse your primary appliance, reconfigure the former primary as the new replica appliance.

  1. Make sure you have SSH access to the former primary appliance.
  2. On the former primary appliance (which will become the new replica), run ghe-repl-setup <ip address> with the IP address of the new primary.
  3. Once the RSA key pair is generated, add your key to the new primary's Authorized SSH keys.
  4. To verify the connection to the new primary and enable replica mode for the new replica, run ghe-repl-setup again.
  5. To begin replication, run ghe-repl-start.

Create a brand new replica appliance

If there was an issue with the primary appliance that caused you to initiate a failover, you may prefer to create a new replica appliance.

  1. Provision and install a new GitHub Enterprise appliance.
  2. Configure HA and make the new system into your replica appliance.

ghe-repl-teardown

The ghe-repl-teardown command disables the replication mode completely, removing the replication configuration.