System Resource Monitoring and Alerting

As use of your GitHub Enterprise instance increases over time, so too will the utilization of system resources, like CPU, memory, and storage. It's important to configure monitoring and alerting so that you're aware of potential issues before they become critical enough to negatively impact application performance or availability.

External monitoring and statistics collection

GitHub Enterprise includes support for monitoring basic system resources via two popular monitoring and statistics collection protocols:

SNMP - A widely supported method of monitoring network devices and servers. SNMP is disabled by default but can be enabled via the Management Console settings page at https://<hostname>/setup/settings. You will also need to make sure UDP port 161 is open and reachable from your network management station. See "Monitoring using SNMP" for more information.
collectd - An open source statistics collection and reporting daemon with built-in support for writing to RRD files. Statistics on CPU utilization, memory and disk consumption, network interface traffic and errors, and system load can be forwarded to an external collectd server where graphs, analysis, and alerting may be configured using a wide range of available tools and plugins. To enable and configure collectd forwarding, see "Configuring collectd".

Both SNMP and collectd forwarding are suitable for use in monitoring basic system resource use. Additionally, the monitoring tools built into underlying virtualization platforms, Amazon Cloud Watch and VMware vSphere Monitoring, may also be used for basic monitoring and alerting of system resources.

Recommended alerting thresholds

Storage

Monitoring of both the root and user storage devices should be configured with values that allow for plenty of time to respond when available disk space runs low. We recommend the following alerting thresholds as a starting point:

Severity	Threshold
Warning	Disk use exceeds 70% of total available.
Critical	Disk use exceeds 90% of total available.

It may be necessary to adjust these values based on the total amount of storage allocated, historical growth patterns, and expected time to respond. If possible, we recommend over-allocating storage resources to allow for growth over time and to prevent the need for maintenance/downtime required to allocate additional physical storage.

CPU and load average

Alerting on CPU utilization can be tricky due to normal fluctuations in CPU use created by resource intense Git operations. Temporary spikes are an expected pattern but prolonged heavy CPU utilization means that the instance is likely under-provisioned. At the very least, we recommend monitoring the 15 minute system load average for values nearing or exceeding the number of CPU cores allocated to the virtual machine.

Severity	Threshold
Warning	15 minute load average exceeds 1x CPU cores.
Critical	15 minute load average exceeds 2x CPU cores.

While the use of system load average as a single metric for CPU utilization is somewhat simplistic and can also indicate issues with the IO subsystem, sustained system load exceeding these values is a good indication that application performance is suffering from lack of compute resources and that increasing the number and/or speed of CPU cores will improve application responsiveness.

Ideally, user, system, and nice CPU utilization is available in graph form to get a clearer picture of where CPU is being consumed and so that historical patterns in utilization can be weighed in decision making about allocating additional resources.

It's also important to monitor virtualization "steal" time to ensure that other virtual machines running on the same host system are not starving the instance of compute resources.

Memory

The amount of physical memory allocated to the GitHub Enterprise instance can have a large impact on overall performance and application responsiveness. The system is designed to make heavy use of kernel disk cache to speed many types of Git operations. As such, we recommend that the normal RSS working set fit within 50% of total available RAM at peak usage.

Severity	Threshold
Warning	Sustained RSS usage exceeds 50% of total available memory.
Critical	Sustained RSS usage exceeds 70% of total available memory.

It's also important to note that the GitHub Enterprise virtual machine does not make use of a swap partition for low memory conditions. If memory is exhausted, the kernel OOM killer will attempt to free memory resources by forcibly killing RAM heavy application processes, which could result in disruption of service. For these reasons, we recommend allocating significantly more memory to the virtual machine than is required in the normal course of operations.

GitHub Enterprise 2.0 Documentation / Guides / System Resource Monitoring and Alerting