GitHub Enterprise includes support for a high availability mode of operation designed to minimize service disruption in the event of hardware failure or major network outage affecting the primary instance.

In this configuration, a fully redundant secondary GitHub Enterprise instance is kept in sync with the primary instance via replication of all major datastores. If service is disrupted, the replica instance may be activated and failed over to within five minutes. The failover process must be initiated manually.

Notes:

  • Currently, GitHub Enterprise supports a single replica instance.
  • The time required to fail over will depend on how long it takes to promote the replica to primary status and redirect traffic to it.

Overview

  • Fully redundant GitHub Enterprise instance.
  • Automated setup of one-way, asynchronous replication of all datastores (Git repositories, MySQL, Redis, and Elasticsearch) from primary to replica instance.
  • Active/Passive configuration. The replica instance runs as a warm standby with database services running in replication mode but application services stopped.
  • Failover requires promoting the replica to primary status by disabling replication services and bringing application services online.
  • DNS failover.
  • Expected failover time: 2 - 10 minutes.

Targeted failure scenarios

The replication and failover features included in GitHub Enterprise 2.0 should only be used for:

  • Software crashes, either due to operating system failure or unrecoverable applications.
  • Primary system hardware failures, including storage hardware, CPU, RAM, network interfaces, etc.
  • Virtualization host system failures, including unplanned and scheduled maintenance events on AWS.
  • Logically or physically severed network at the primary site, assuming the replica instance is segregated from the primary at a network level not impacted by the failure.

The replication features are not suitable for use in these scenarios:

  • Scale-out, including geo-distributed Git read mirrors. Serving application or Git requests from a replica instance is not yet supported.
  • Backups and DR. An HA replica does not replace the need for off-site backups in your disaster recovery plan. Some forms of data corruption or loss may be replicated immediately from primary to replica. To ensure safe rollback to a known good past state, you must perform regular backups with historical snapshots.
  • Zero downtime upgrades. To prevent data loss and split-brain situations in controlled promotion scenarios, place the primary instance in maintenance mode and wait for all writes to complete before promoting the replica.

Network traffic failover strategies

The replication capabilities built into GitHub Enterprise handle keeping a replica instance's datastores in sync with the primary instance and allow promoting a replica instance to primary mode. However, you must separately configure and manage redirecting network traffic from primary to replica during failover.

DNS failover

GitHub Enterprise 2.0 supports DNS failover only. This approach is simple to set up and avoids the need for additional network components. It works equally well across datacenters and networks as it does within the same network.

With DNS failover, you should use short TTL values in the DNS records that point to the primary GitHub Enterprise instance. We recommend a TTL between 60 seconds and five minutes.

During failover, you must place the primary into maintenance mode (if available) and redirect its DNS records to the replica instance's IP address. The time needed to redirect traffic from primary to replica will depend on the TTL configuration and time required to update the DNS records.

Replica instance provisioning and SSH access

Use the GitHub Enterprise command line utilities to:

  • Replicate all major datastores.
  • Monitor replication status.
  • Perform failover.

Provision a new GitHub Enterprise 2.0 instance to act as the replica

Provision a new instance for your desired platform. The replica instance specs should mirror the primary instance's in CPU, RAM, and storage.

  • If you use AWS, we strongly recommend booting the new instance in a separate availability zone.
  • If you use VMware, provision the virtual machine under a separate ESXi host environment or virtual datacenter, if available. Ideally, all underlying hardware, software, and network components should be isolated from those of the primary instance.

Upload license and set admin password and key

Note: You may skip this step if you already have SSH access to the replica via an EC2 Key Pair.

  1. In a browser, navigate to the new replica instance's IP address and upload your GitHub Enterprise license.

  2. Set an admin password that matches the password on the primary and continue.

  3. Choose New Install when prompted to set up a New Install or Migrate from another instance.

  4. On the settings page, add an SSH key to the list of Authorized SSH keys. This key will be used to:

    • SSH to the replica instance for initial setup.
    • Monitor replication status.
    • Perform manual failover (if necessary).

No additional configuration is necessary, as all other settings will be copied from the primary instance during replica setup.

Connect to the replica instance over SSH

You should now be able to open a terminal and connect to the replica instance's IP address using the administrative port and user as follows:

   $ ssh -p 122 admin@169.254.1.2

At this point, you should have a new GitHub Enterprise instance up and running with SSH access for running the replication utilities.

Utilities for replication management

The replication system is managed via a small set of utilities executed on the replica instance.

The following examples assume you've connected to the replica instance over SSH as the admin user as shown in the previous section.

ghe-repl-setup

The ghe-repl-setup command puts a GitHub Enterprise instance into replica / warm standby mode.

  • An encrypted openvpn tunnel is configured for communication between the two instances.
  • Database services are configured for replication and started.
  • Application services are disabled. Attempts to access the replica instance over HTTP, Git, or other supported protocols will result in a "instance in replica mode" maintenance page or error message.

To enable replica mode, run the ghe-repl-setup command, providing the IP address of the primary instance as follows:

admin@169-254-1-2:~$ ghe-repl-setup 169.254.1.1
Generating rsa key pair for communication with primary GitHub instance.
The primary GitHub Enterprise instance must be configured to allow replica access.
Visit http://169.254.1.1/setup/settings and authorize the following SSH key:
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCzear28WOySkn/bYoZpwacsLzkWWZDGqQLOC5bEXz50moR+BvfL1sdY/3i+KZUt0WOFcImXyDXPV1LKYVpY759ObXsibryq0C7DbdMsVb7s9/DLWfB8b7qYchzJY/rE9hUzIXeEccgtF0gYNPH/oTDR+CP/e49Goys0zZ8Axso2zKpINMexzOblJ6eJRivQcp6qVZt9NtGikc/QDeIc4tHwIKQTU5DijaCv+97CivH2q0R9epS86PpeH9usA0nOn7mxmGgfJpLIvbRGhjG44vWXewrr0GkOsKrrp8OC79lkR3VKqU/i6+wk2PYqFpLFCcPXw3Ru4VHcKaSqNUJ6mKH ha-replica-ip-169-254-1-2
Run `ghe-repl-setup 169.254.1.1' once the key has been added to continue replica setup.

On first invocation, the ghe-repl-setup command generates a new RSA key pair for SSH communication with the primary instance and writes the public key to standard output. Add the key provided to the primary instance's Authorized SSH keys and run the ghe-repl-setup command again:

admin@169-254-1-2:~$ ghe-repl-setup 169.254.1.1
Verifying ssh connectivity with 169.254.1.1 ...
Connection check succeeded.
Configuring database replication against primary ...
Success: Replica mode is configured against 169.254.1.1.
To disable replica mode and undo these changes, run `ghe-repl-teardown'.
Run `ghe-repl-start' to start replicating against the newly configured primary.

The instance is in replica mode with verified connectivity between instances, but actual replication has not started.

ghe-repl-start

The ghe-repl-start command is used to turn on active replication of all datastores:

admin@169-254-1-2:~$ ghe-repl-start
Starting OpenVPN tunnel ...
Starting MySQL replication ...
Starting Redis replication ...
Starting Elasticsearch replication ...
Starting Pages replication ...
Starting Git replication ...
Success: replication is running for all services.
Use `ghe-repl-status' to monitor replication health and progress.

Once replication is started, each datastore will begin transferring data from the primary to the replica. On a new replica, this may take some time. Once the initial dataset is copied, new changes will be replicated in near real time.

ghe-repl-status

The ghe-repl-status command can be used to check on the status of each datastore's replication channel.

For each datastore, the replication stream may be in an OK, WARNING or CRITICAL state.

admin@169-254-1-2:~$ ghe-repl-status
OK: mysql replication in sync
OK: redis replication is in sync
OK: elasticsearch cluster is in sync
OK: git data is in sync (10 repos, 2 wikis, 5 gists)
OK: pages data is in sync

When any of the replication channels are in a WARNING state, the command will exit with the code 1. Similarly, when any of the channels are in a CRITICAL state, the command will exit with the code 2.

Additional details about each datastore's replication state are available via the -v and -vv options.

$ ghe-repl-status -v
OK: mysql replication in sync
  | IO running: Yes, SQL running: Yes, Delay: 0
OK: redis replication is in sync
  | master_host:169.254.1.1
  | master_port:6379
  | master_link_status:up
  | master_last_io_seconds_ago:3
  | master_sync_in_progress:0
OK: elasticsearch cluster is in sync
  | {
  |   "cluster_name" : "github-enterprise",
  |   "status" : "green",
  |   "timed_out" : false,
  |   "number_of_nodes" : 2,
  |   "number_of_data_nodes" : 2,
  |   "active_primary_shards" : 12,
  |   "active_shards" : 24,
  |   "relocating_shards" : 0,
  |   "initializing_shards" : 0,
  |   "unassigned_shards" : 0
  | }
OK: git data is in sync (366 repos, 31 wikis, 851 gists)
  |                   TOTAL         OK      FAULT    PENDING      DELAY
  | repositories        366        366          0          0        0.0
  |        wikis         31         31          0          0        0.0
  |        gists        851        851          0          0        0.0
  |        total       1248       1248          0          0        0.0
OK: pages data is in sync
  | Pages are in sync

ghe-repl-stop

The ghe-repl-stop command is used to disable replication for all datastores, either temporarily or permanently.

admin@168-254-1-2:~$ ghe-repl-stop
Stopping Pages replication ...
Stopping Git replication ...
Stopping MySQL replication ...
Stopping Redis replication ...
Stopping Elasticsearch replication ...
Stopping OpenVPN tunnel ...
Success: replication was stopped for all services.

Replication may be restarted at any time with the ghe-repl-start command and will resume from the previous replication position for all datastores.

ghe-repl-promote

The ghe-repl-promote command disables replication and converts the replica instance into a primary. The instance is configured with the same settings as the original primary and all services are enabled so the instance can start serving user reads and writes.

admin@168-254-1-2:~$ ghe-repl-promote
Enabling maintenance mode on the primary to prevent writes ...
Stopping replication ...
  | Stopping Pages replication ...
  | Stopping Git replication ...
  | Stopping MySQL replication ...
  | Stopping Redis replication ...
  | Stopping Elasticsearch replication ...
  | Stopping OpenVPN tunnel ...
  | Success: replication was stopped for all services.
Switching out of replica mode ...
  | Success: Replication configuration has been removed.
  | Run `ghe-repl-setup' to re-enable replica mode.
Applying configuration and starting services ...
Success: Replica has been promoted to primary and is now accepting requests.

Initiating a failover to your replica instance

  1. Put the primary into maintenance mode to allow replication to catch up before you switch to the replica instance.
  2. When the number of active Git operations reaches zero, wait an additional 30 seconds.
  3. On the replica instance, run ghe-repl-status -vv. Verify that all replication channels report OK.
  4. To stop replication and to tell the replica instance that it has been promoted to primary, run ghe-repl-promote.
  5. If you are using DNS, update the DNS record to point to the IP address of the replica. After the TTL period elapses, the replica will become the primary instance.
  6. Once DNS has been updated, notify users that they can resume normal operations.

Recovering HA post-failover

After failing over, you will be relying on a single instance and will want to regain redundancy as soon as possible.

  • If the failover was planned or was not related to the health of the primary instance, you can likely use the former primary as the new replica instance.
  • If the failover was related to an issue with the primary instance, you may prefer to start with a fresh replica instance.

Make the former primary the new replica

There are a number of reasons to fail over that are not related to the health of the primary instance. If you initiate a planned failover and want to reuse your primary instance, reconfigure the former primary as the new replica instance.

  1. Make sure you have SSH access to the former primary instance.
  2. On the former primary instance (which will become the new replica), run ghe-repl-setup <ip address> with the IP address of the new primary.
  3. Once the RSA key pair is generated, add your key to the new primary's Authorized SSH keys.
  4. To verify the connection to the new primary and enable replica mode for the new replica, run ghe-repl-setup again.
  5. To begin replication, run ghe-repl-start.

Create a brand new replica instance

If there was an issue with the primary instance that caused you to initiate a failover, you may prefer to create a new replica instance.

  1. Provision and install a new instance of GitHub Enterprise.
  2. Configure HA and make the new system into your replica instance.