Skip to main content

Replacing a cluster node

If a node fails in a GitHub Enterprise Server cluster, or if you want to add a new node with more resources, mark any nodes to replace as offline, then add the new node.

Who can use this feature?

GitHub determines eligibility for clustering, and must enable the configuration for your instance's license. Clustering requires careful planning and additional administrative overhead. For more information, see About clustering.

About replacement of GitHub Enterprise Server cluster nodes

You can replace a functional node in a GitHub Enterprise Server cluster, or you can replace a node that has failed unexpectedly.

After you replace a node, your GitHub Enterprise Server instance does not automatically distribute jobs to the new node. You can force your instance to balance jobs across nodes. For more information, see Rebalancing cluster workloads.

Warning

To avoid conflicts, do not reuse a hostname that was previously assigned to a node in the cluster.

Replacing a functional node

You can replace an existing, functional node in your cluster. For example, you may want to provide a virtual machine (VM) with additional CPU, memory, or storage resources.

To replace a functional node, install the GitHub Enterprise Server appliance on a new VM, configure an IP address, add the new node to the cluster configuration file, initialize the cluster and apply the configuration, then take the node you replaced offline.

Note

If you're replacing the primary MySQL node, see Replacing the primary MySQL node.

  1. Provision and install GitHub Enterprise Server with a unique hostname on the replacement node.

  2. Using the administrative shell or DHCP, only configure the IP address of the replacement node. Don't configure any other settings.

  3. To add the newly provisioned replacement node, on any node, modify the cluster.conf file to remove the failed node and add the replacement node. For example, this modified cluster.conf file replaces ghe-data-node-3 with the newly provisioned node, ghe-replacement-data-node-3:

    [cluster "ghe-replacement-data-node-3"]
      hostname = ghe-replacement-data-node-3
      ipv4 = 192.168.0.7
      # ipv6 = fd12:3456:789a:1::7
      consul-datacenter = PRIMARY-DATACENTER
      git-server = true
      pages-server = true
      mysql-server = true
      elasticsearch-server = true
      redis-server = true
      memcache-server = true
      metrics-server = true
      storage-server = true
    

    You can choose to defer database seeding of a new MySQL replica node, resulting in being able to open your appliance to traffic sooner. For more information, see Deferring database seeding.

  4. From the administrative shell of the node with the modified cluster.conf, run ghe-cluster-config-init. This will initialize the newly added node in the cluster.

  5. From the same node, run ghe-cluster-config-apply. This will validate the configuration file, copy it to each node in the cluster, and configure each node according to the modified cluster.conf file.

  6. To take the node you're replacing offline, from the primary MySQL node of your cluster, run the following command.

    ghe-remove-node NODE-HOSTNAME
    

    This command will evacuate data from any data services running on the node, mark the node as offline in your configuration, and stop traffic being routed to the node. For more information, see Command-line utilities.

Replacing a node in an emergency

You can replace a failed node in your cluster. For example, a software or hardware issue may affect a node's availability.

Note

If you're replacing the primary MySQL node, see Replacing the primary MySQL node.

To replace a node in an emergency, you'll take the failed node offline, add your replacement node to the cluster, then run commands to remove references to data services on the removed node.

  1. To remove the node that is experiencing issues from the cluster, from the primary MySQL node of your cluster, run the following command. Replace NODE-HOSTNAME with the hostname of the node you're taking offline.

    ghe-remove-node --no-evacuate NODE-HOSTNAME
    

    This command will mark the node as offline in your configuration and stop traffic being routed to the node. You can run this command in no-evacuate mode now because, later in this procedure, you'll run commands that instruct data services on the node to copy any replicas onto the other available nodes in the cluster. For more information, see Command-line utilities.

  2. Add your replacement node to the cluster.

    1. Provision and install GitHub Enterprise Server with a unique hostname on the replacement node.

    2. Using the administrative shell or DHCP, only configure the IP address of the replacement node. Don't configure any other settings.

    3. To add the newly provisioned replacement node, on any node, modify the cluster.conf file to add the replacement node. For example, this modified cluster.conf file adds the newly provisioned node ghe-replacement-data-node-3:

      [cluster "ghe-replacement-data-node-3"]
        hostname = ghe-replacement-data-node-3
        ipv4 = 192.168.0.7
        # ipv6 = fd12:3456:789a:1::7
        git-server = true
        pages-server = true
        mysql-server = true
        elasticsearch-server = true
        redis-server = true
        memcache-server = true
        metrics-server = true
        storage-server = true
      
    4. From the administrative shell of the node with the modified cluster.conf, run ghe-cluster-config-init. This will initialize the newly added node in the cluster.

    5. From the same node, run ghe-cluster-config-apply. This will validate the configuration file, copy it to each node in the cluster, and configure each node according to the modified cluster.conf file.

  3. Remove references to data services on the node you removed.

    1. Find the UUID of the node you removed. To find the UUID, run the following command, replacing HOSTNAME with the hostname of the node. You will use this UUID in the next step.

      ghe-config cluster.HOSTNAME.uuid
      
    2. To remove references to data services, run the following commands. Replace UUID with the UUID of the node.

      These commands indicate to each service that the node is permanently removed. The services will recreate any replicas contained within the node on the available nodes within the cluster.

      Note

      These commands may cause increased load on the server while data is rebalanced across replicas.

      For the git-server service (used for repository data):

      ghe-spokesctl server destroy git-server-UUID
      

      For the pages-server service (used for GitHub Pages site builds):

      ghe-dpages remove pages-server-UUID
      

      For the storage-server service (used for Git LFS data, avatar images, file attachments, and release archives):

      ghe-storage destroy-host storage-server-UUID --force
      
  4. Optionally, delete the entry for the removed node in your cluster.conf file. Doing so will keep your cluster.conf file organized and save time during future config-apply runs.

    1. To remove the entry from the file, run the following command, replacing HOSTNAME with the hostname of the removed node.

      ghe-config --remove-section "cluster.HOSTNAME"
      
    2. To copy the configuration to other nodes in the cluster, from the administrative shell of the node where you modified cluster.conf, run ghe-cluster-config-apply.

Replacing the primary MySQL node

To provide database services, your cluster requires a primary MySQL node and at least one replica MySQL node. For more information, see About cluster nodes.

If you want to provide the VM for your primary MySQL node with more resources, or if the node fails, you can replace the node. To minimize downtime, add the new node to your cluster, replicate the MySQL data, and then promote the node. Some downtime is required during promotion.

  1. Provision and install GitHub Enterprise Server with a unique hostname on the replacement node.

  2. Using the administrative shell or DHCP, only configure the IP address of the replacement node. Don't configure any other settings.

  3. To connect to your GitHub Enterprise Server instance, SSH into any of your cluster's nodes. From your workstation, run the following command. Replace HOSTNAME with the node's hostname. For more information, see Accessing the administrative shell (SSH).

    Shell
    ssh -p 122 admin@HOSTNAME
    
  4. Open the cluster configuration file at /data/user/common/cluster.conf in a text editor. For example, you can use Vim. Create a backup of the cluster.conf file before you edit the file.

    Shell
    sudo vim /data/user/common/cluster.conf
    
  5. The cluster configuration file lists each node under a [cluster "HOSTNAME"] heading. Add a new heading for the node and enter the key-value pairs for configuration, replacing the placeholders with actual values.

    • Ensure that you include the mysql-server = true key-value pair.
    • The following section is an example, and your node's configuration may differ.
    ...
    [cluster "HOSTNAME"]
      hostname = HOSTNAME
      ipv4 = IPV4-ADDRESS
      # ipv6 = IPV6-ADDRESS
      consul-datacenter = PRIMARY-DATACENTER
      datacenter = DATACENTER
      mysql-server = true
      redis-server = true
      ...
    ...
    
  6. From the administrative shell of the node with the modified cluster.conf, run ghe-cluster-config-init. This will initialize the newly added node in the cluster.

  7. From the administrative shell of the node where you modified cluster.conf, run ghe-cluster-config-apply. The newly added node will become a replica MySQL node and any other configured services will run there.

  8. Wait for MySQL replication to finish. To monitor MySQL replication from any node in the cluster, run ghe-cluster-status -v.

    Shortly after adding the node to the cluster, you may see an error for replication status while replication catches up. Replication can take hours depending on the instance's load, the amount of database data, and the last time the instance generated a database seed.

  9. During your scheduled maintenance window, enable maintenance mode. For more information, see Enabling and scheduling maintenance mode.

  10. Ensure that MySQL replication is finished from any node in the cluster by running ghe-cluster-status -v.

    Warning

    If you do not wait for MySQL replication to finish, you risk data loss on your instance.

  11. To set the current MySQL primary node to read-only mode, run the following command from of the instance's nodes.

    Shell
    echo "SET GLOBAL super_read_only = 1;" | sudo mysql
    
  12. Wait until Global Transaction Identifiers (GTIDs) set on the primary and replica MySQL nodes are identical. To check the GTIDs, run the following command from any of the instance's nodes.

    Shell
    ghe-cluster-each -r mysql -- 'echo "SELECT @@global.gtid_executed;" | sudo mysql'
    
  13. After the GTIDs on the primary and replica MySQL nodes match, update the cluster configuration by opening the cluster configuration file at /data/user/common/cluster.conf in a text editor.

    • Create a backup of the cluster.conf file before you edit the file.
    • In the top-level [cluster] section, remove the hostname for the node you replaced from the mysql-master key-value pair, then assign the new node instead. If the new node is also a primary Redis node, adjust the redis-master key-value pair.
    [cluster]
      mysql-master = NEW-NODE-HOSTNAME
      redis-master = NEW-NODE-HOSTNAME
      primary-datacenter = primary
    ...
    
  14. From the administrative shell of the node where you modified cluster.conf, run ghe-cluster-config-apply. This will reconfigure the cluster so that the newly added node becomes the primary MySQL node and the original primary MySQL node becomes a replica MySQL node.

  15. Check the status of MySQL replication from any node in the cluster by running ghe-cluster-status -v.

  16. If MySQL replication is finished, from any node in the cluster, disable maintenance mode. For more information, see Enabling and scheduling maintenance mode.