System resource monitoring and alerting

As use of your GitHub Enterprise instance increases over time, so too will the utilization of system resources, like CPU, memory, and storage. It's important to configure monitoring and alerting so that you're aware of potential issues before they become critical enough to negatively impact application performance or availability.

GitHub Enterprise includes a web-based monitoring dashboard that displays historical data about your instance, such as:

General system health such as CPU and memory usage
Application throughput, response times, and error rates
Storage usage, throughput, and latency
Background jobs and queue length
Network throughput, number of clients, and error rates
Database usage and throughput
Active web workers and queued web workers
Redis and Elasticsearch usage
Total authentications
Rate of authentication attempts
LDAP authentications and response times
LDAP sync number and rate for both existing and new users and teams
LDAP authentication timeout data

Accessing the internal monitoring dashboard

In the upper-right corner of any page, click .
In the left sidebar, click Management Console.
At the top of the page, click Monitor.

Troubleshooting common scenarios

Scenario	Possible cause(s)	Recommendations
High CPU usage	VM contention from other services or programs running on the same host.	If possible, reconfigure other services or programs to use fewer CPU resources. To increase total CPU resources for the VM, see "Increasing CPU or Memory Resources".
High memory usage	VM contention from other services or programs running on the same host.	If possible, reconfigure other services or programs to use less memory. To increase the total memory available on the VM, follow the steps for your platform in "Increasing CPU or Memory Resources".
Low disk space availability	Large binaries or log files consuming disk space.	If possible, host large binaries on a separate server, and compress or archive log files. If necessary, increase disk space on the VM by following the steps for your platform in "Increasing storage capacity".
Higher than usual response times	Often caused by one of the above issues.	Identify and fix the underlying issues. If response times remain high, contact GitHub Enterprise Support.
Elevated error rates	Software issues.	Contact GitHub Enterprise Support and include your Support Bundle.

Note: Because regularly polling your GitHub Enterprise instance with continuous integration (CI) or build servers can effectively cause a Denial of Service attack that results in one or more of the above problems, we recommend using webhooks to push updates. For more information, see "About webhooks".

External monitoring and statistics collection

GitHub Enterprise includes support for monitoring basic system resources via two popular monitoring and statistics collection protocols:

SNMP - A widely supported method of monitoring network devices and servers. SNMP is disabled by default but can be enabled via the Management Console settings page at https://<hostname>/setup/settings. You will also need to make sure UDP port 161 is open and reachable from your network management station. See "Monitoring using SNMP" for more information.
collectd - An open source statistics collection and reporting daemon with built-in support for writing to RRD files. Statistics on CPU utilization, memory and disk consumption, network interface traffic and errors, and system load can be forwarded to an external collectd server where graphs, analysis, and alerting may be configured using a wide range of available tools and plugins. To enable and configure collectd forwarding, see "Configuring collectd".

Both SNMP and collectd forwarding are suitable for use in monitoring basic system resource use. Additionally, the monitoring tools built into underlying virtualization platforms, Amazon CloudWatch and VMware vSphere Monitoring, may also be used for basic monitoring and alerting of system resources.

Recommended alerting thresholds

Storage

Monitoring of both the root and user storage devices should be configured with values that allow for plenty of time to respond when available disk space runs low. We recommend the following alerting thresholds as a starting point:

Severity	Threshold
Warning	Disk use exceeds 70% of total available.
Critical	Disk use exceeds 90% of total available.

It may be necessary to adjust these values based on the total amount of storage allocated, historical growth patterns, and expected time to respond. If possible, we recommend over-allocating storage resources to allow for growth over time and to prevent the need for maintenance/downtime required to allocate additional physical storage.

CPU and load average

Alerting on CPU utilization can be tricky due to normal fluctuations in CPU use created by resource intense Git operations. Temporary spikes are an expected pattern but prolonged heavy CPU utilization means that the instance is likely under-provisioned. At the very least, we recommend monitoring the 15 minute system load average for values nearing or exceeding the number of CPU cores allocated to the virtual machine.

Severity	Threshold
Warning	15 minute load average exceeds 1x CPU cores.
Critical	15 minute load average exceeds 2x CPU cores.

For example, a virtual machine with 4 vCPUs would reflect these thresholds:

Severity	Threshold
Warning	15 minute load average exceeds 4
Critical	15 minute load average exceeds 8

While the use of system load average as a single metric for CPU utilization is somewhat simplistic and can also indicate issues with the IO subsystem, sustained system load exceeding these values is a good indication that application performance is suffering from lack of compute resources and that increasing the number and/or speed of CPU cores will improve application responsiveness.

Ideally, user, system, and nice CPU utilization is available in graph form to get a clearer picture of where CPU is being consumed and so that historical patterns in utilization can be weighed in decision making about allocating additional resources.

It's also important to monitor virtualization "steal" time to ensure that other virtual machines running on the same host system are not starving the instance of compute resources.

Memory

The amount of physical memory allocated to your GitHub Enterprise instance can have a large impact on overall performance and application responsiveness. The system is designed to make heavy use of kernel disk cache to speed many types of Git operations. As such, we recommend that the normal RSS working set fit within 50% of total available RAM at peak usage.

Severity	Threshold
Warning	Sustained RSS usage exceeds 50% of total available memory.
Critical	Sustained RSS usage exceeds 70% of total available memory.

It's also important to note that your GitHub Enterprise instance does not make use of a swap partition for low memory conditions. If memory is exhausted, the kernel OOM killer will attempt to free memory resources by forcibly killing RAM heavy application processes, which could result in disruption of service. For these reasons, we recommend allocating significantly more memory to the virtual machine than is required in the normal course of operations.

GitHub Enterprise 2.9 Documentation / Guides / System resource monitoring and alerting