About Node Eligibility Service
In a GitHub Enterprise Server cluster, an individual node may become unreachable by other nodes due to a hardware or software failure. After time, even if you restore the node's health, the subsequent synchronization of data can negatively impact your instance's performance.
You can proactively mitigate the impact of reduced node availability by using Node Eligibility Service. This service monitors the state of your cluster's nodes and emits a warning if a node has been offline for too long. You can also prevent an offline node from rejoining the cluster. Optionally, you can allow Node Eligibility Service to take ineligible nodes offline.
By default, Node Eligibility Service is disabled. If you enable Node Eligibility Service, your instance will alert you of unhealthy nodes by displaying a banner in the administrative web UI for GitHub Enterprise Server, and in CLI output for some cluster-related utilities, such as ghe-config-apply
and ghe-cluster-diagnostics
.
Node Eligibility Service allows you to monitor the health of individual nodes. You can also monitor the overall health of your cluster. For more information, see Monitoring the health of your cluster.
About health and eligibility of cluster nodes
To determine whether to emit a warning or automatically adjust the configuration of your cluster, Node Eligibility Service continuously monitors the health of each node. Each node regularly reports a timestamped health state, which Node Eligibility Service compares to a Time To Live (TTL) duration.
Each node has a health state and an eligibility state.
- Health refers to the accessibility of the node within the cluster and has three possible states:
healthy
,warning
, orcritical
. - Eligibility refers to the ability of the node to work in the cluster and has two possible states:
eligible
orineligible
.
Node Eligibility Service provides a configurable TTL setting for two states, warn
and fail
.
warn
: The node has been offline for a short period of time. This may indicate something is wrong with the node and that administrators should investigate. The default setting is 15 minutes.fail
: The node has been offline for a long period of time, and reintroduction into the cluster could cause performance issues due to resynchronization. The default setting is 60 minutes.
For each node, Node Eligibility Service determines health and eligibility for participation in the cluster in the following ways.
- If a node has been observed to be healthy, the health state is
healthy
and the eligibility state iseligible
. - If a node hasn't been observed to be healthy for longer than the
warn
TTL, the health state iswarning
and the eligibility state iseligible
. - If a node hasn't been observed to be healthy for longer than the
fail
TTL, the health state iscritical
and its eligibility state isineligible
.
Enabling Node Eligibility Service for your cluster
By default, Node Eligibility Service is disabled. You can enable Node Eligibility Service by setting the value for app.nes.enabled
using ghe-config
.
-
To connect to your GitHub Enterprise Server instance, SSH into any of your cluster's nodes. From your workstation, run the following command. Replace HOSTNAME with the node's hostname. For more information, see "Accessing the administrative shell (SSH)."
Shell ssh -p 122 admin@HOSTNAME
ssh -p 122 admin@HOSTNAME
-
To verify whether Node Eligibility Service is currently enabled, run the following command.
Shell ghe-config app.nes.enabled
ghe-config app.nes.enabled
-
To enable Node Eligibility Service, run the following command.
Shell ghe-config app.nes.enabled true
ghe-config app.nes.enabled true
-
To apply the configuration, run the following command.
Note
During a configuration run, services on your GitHub Enterprise Server instance may restart, which can cause brief downtime for users.
Shell ghe-config-apply
ghe-config-apply
-
Wait for the configuration run to complete.
-
To verify that Node Eligibility Service is running, from any node, run the following command.
Shell nomad status nes
nomad status nes
Configuring TTL settings for Node Eligibility Service
To determine how Node Eligibility Service notifies you, you can configure TTL settings for fail
and warn
states. The TTL for the fail
state must be higher than the TTL for the warn
state.
-
To connect to your GitHub Enterprise Server instance, SSH into any of your cluster's nodes. From your workstation, run the following command. Replace HOSTNAME with the node's hostname. For more information, see "Accessing the administrative shell (SSH)."
Shell ssh -p 122 admin@HOSTNAME
ssh -p 122 admin@HOSTNAME
-
To verify the current TTL settings, run the following command.
Shell nes get-node-ttl all
nes get-node-ttl all
-
To set the TTL for the
fail
state, run the following command. Replace MINUTES with the number of minutes to use for failures.Shell nes set-node-ttl fail MINUTES
nes set-node-ttl fail MINUTES
-
To set the TTL for the
warn
state, run the following command. Replace MINUTES with the number of minutes to use for warnings.Shell nes set-node-ttl warn MINUTES
nes set-node-ttl warn MINUTES
Managing whether Node Eligibility Service can take a node offline
By default, Node Eligibility Service provides alerts to notify you about changes to the health of cluster nodes. Optionally, if the service determines that an unhealthy node is ineligible to rejoin the cluster, you can allow the service to take the node offline.
When a node is taken offline, the instance removes job allocations from the node. If the node runs data storage services, Node Eligibility Service updates the configuration to reflect the node's ineligibility to rejoin the cluster.
To manage whether Node Eligibility Service can take a node and its services offline, you can configure adminaction
states for the node. If a node is in the approved
state, Node Eligibility Service can take the node offline. If a node is in the none
state, Node Eligibility Service cannot take the node offline.
-
To connect to your GitHub Enterprise Server instance, SSH into any of your cluster's nodes. From your workstation, run the following command. Replace HOSTNAME with the node's hostname. For more information, see "Accessing the administrative shell (SSH)."
Shell ssh -p 122 admin@HOSTNAME
ssh -p 122 admin@HOSTNAME
-
To configure whether Node Eligibility Service can take a node offline, run one of the following commands.
-
To allow the service to automatically take administrative action when a node goes offline, run the following command. Replace HOSTNAME with the node's hostname.
Shell nes set-node-adminaction approved HOSTNAME
nes set-node-adminaction approved HOSTNAME
-
To revoke Node Eligibility Service's ability to take a node offline, run the following command. Replace HOSTNAME with the node's hostname.
Shell nes set-node-adminaction none HOSTNAME
nes set-node-adminaction none HOSTNAME
-
Viewing an overview of node health
To view an overview of your nodes' health using Node Eligibility Service, use one of the following methods.
- SSH into any node in the cluster, then run
nes get-cluster-health
. - Navigate to the Management Console's "Status" page. For more information, see Accessing the Management Console.
Re-enabling an ineligible node to join the cluster
After Node Eligibility Service detects that a node has exceeded the TTL for the fail
state, and after the service marks the node as ineligible
, the service will no longer update the health status for the node. To re-enable a node to join the cluster, you can remove the ineligible
status from the node.
-
To connect to your GitHub Enterprise Server instance, SSH into any of your cluster's nodes. From your workstation, run the following command. Replace HOSTNAME with the node's hostname. For more information, see "Accessing the administrative shell (SSH)."
Shell ssh -p 122 admin@HOSTNAME
ssh -p 122 admin@HOSTNAME
-
To check the current
adminaction
state for the node, run the following command. Replace HOSTNAME with the hostname of the ineligible node.Shell nes get-node-adminaction HOSTNAME
nes get-node-adminaction HOSTNAME
-
If the
adminaction
state is currently set toapproved
, change the state tonone
by running the following command. Replace HOSTNAME with the hostname of the ineligible node.Shell nes set-node-adminaction none HOSTNAME
nes set-node-adminaction none HOSTNAME
-
To ensure the node is in a healthy state, run the following command and confirm that the node's status is
ready
.Shell nomad node status
nomad node status
-
If the node's status is
ineligible
, make the node eligible by connecting to the node via SSH and running the following command.Shell nomad node eligibility -enable -self
nomad node eligibility -enable -self
-
-
To update the node's eligibility in Node Eligibility Service, run the following command. Replace HOSTNAME with the node's hostname.
Shell nes set-node-eligibility eligible HOSTNAME
nes set-node-eligibility eligible HOSTNAME
-
Wait 30 seconds, then check the cluster's health to confirm the target node is eligible by running the following command.
Shell nes get-cluster-health
nes get-cluster-health
Viewing logs for Node Eligibility Service
You can view logs for Node Eligibility Service from any node in the cluster, or from the node that runs the service. If you generate a support bundle, the logs are included. For more information, see Providing data to GitHub Support.
-
To connect to your GitHub Enterprise Server instance, SSH into any of your cluster's nodes. From your workstation, run the following command. Replace HOSTNAME with the node's hostname. For more information, see "Accessing the administrative shell (SSH)."
Shell ssh -p 122 admin@HOSTNAME
ssh -p 122 admin@HOSTNAME
-
To view logs for Node Eligibility Service from any node in the cluster, run the following command.
Shell nomad alloc logs -job nes
nomad alloc logs -job nes
-
Alternatively, you can view logs for Node Eligibility Service on the node that runs the service. The service writes logs to the systemd journal.
-
To determine which node runs Node Eligibility Service, run the following command.
Shell nomad job status "nes" | grep running | grep "${nomad_node_id}" | awk 'NR==2{ print $1 }' | xargs nomad alloc status | grep "Node Name"
nomad job status "nes" | grep running | grep "${nomad_node_id}" | awk 'NR==2{ print $1 }' | xargs nomad alloc status | grep "Node Name"
-
To view logs on the node, connect to the node via SSH, then run the following command.
Shell journalctl -t nes
journalctl -t nes
-