Monitoring describes the term of checking if an application is working properly. It is desirable if the application is fully down, but it is also worth checking if the application is performing as desired. This can range from collecting system performance metrics to collecting and analyzing business metrics. Therefore, the border between system monitoring and business analytics is fluid, and business analytics often flow into alerting for system administrators.

Periodic Checks

When thinking of monitoring, the first motoring system most system administrators implement is uptime monitoring. This type of monitoring will check if, for example, the front page of the application can be loaded, or a database system can be accessed.

This kind of monitoring helps see in detail which system is having problems but is often not enough to fully monitor a system since it is hard to implement an early warning system based on a single check, which is where metrics collection comes in handy.

Functionality Monitoring

An extended version of uptime monitoring is functionality monitoring. This monitoring is also a periodic check, but instead of checking a single component of the system, it simulates a real user interaction using automation tools like Selenium. One example could be to, for example, simulate user signup and check if it is still possible. This kind of monitoring is useful, but can often lead to false alarms and is expensive to maintain.

Metrics Collection

When it comes to early warning and over-time monitoring, metrics systems are incredibly useful. These systems collect various metrics, ranging from CPU usage to customer signups in the last minute, and store them in a so-called time-series database. These metrics can then be used to aggregate data, overlay them over past data, predict trends, etc. Often these metrics are also used to trigger alerts if the values are outside of a particular range.

Log Collection

Another critical aspect of monitoring and compliance is log collection. System logs store information such as errors emitted by applications, or even audit logs of who logged in to a system. These logs need to be centralized and are often analyzed, as well.

Alerting

All the above systems can result in alerts to technical personnel. When a system fails, or an early warning is issued, an alerting system notifies the staff member on duty via phone call, SMS or push notification that action needs to be taken. More advanced systems also track if the staff member responds in time, and escalates the issue.

Dashboards

A beneficial tool to have for staff members is a collection of dashboards to view the metrics and alerts. Well implemented dashboards are a convenient tool for staff to work on issues.

Monitoring Solutions

We list several solutions here as examples. There are many more solutions and services, and we make no claims about the completeness or correctness of this data.

Name Type Periodic Checks Metrics Collection Log Collection Alerting
Icinga self hosted X X X
Nagios self hosted X X
Zabbix self hosted X X X
Collectd self hosted X X
Munin self hosted X X
Prometheus self hosted X X X
UptimeRobot cloud X X
VictorOps cloud X
PagerDuty cloud X
DataDog cloud X X X X
New Relic cloud X X X X
ELK (Elastic Search, Logstash, Kibana) self hosted / cloud X X X