Monitoring

Tagged in
- academy

Monitoring describes the term of checking if an application is working correctly. Of course, it is desirable if the application is fully down, but it is also worth checking if the application is performing as desired. This can range from collecting system performance metrics to collecting and analyzing business metrics. Therefore, the border between system monitoring and business analytics is fluid, and business analytics often flow into alerting for system administrators.

Periodic Checks

When thinking of monitoring, the first motoring system most system administrators implement is uptime monitoring. This type of monitoring will check if, for example, the front page of the application can be loaded or a database system can be accessed.

This kind of monitoring helps see in detail which system is having problems but is often not enough to fully monitor a system since it is hard to implement an early warning system based on a single check, where metrics collection comes in handy.

Functionality Monitoring

An extended version of uptime monitoring is functionality monitoring. This monitoring is also a periodic check, but instead of checking a single system component, it simulates a real user interaction using automation tools like Selenium. One example could be to, for example, simulate user signup and check if it is still possible. This kind of monitoring is helpful but can often lead to false alarms and is expensive to maintain.

Metrics Collection

When it comes to early warning and over-time monitoring, metrics systems are beneficial. These systems collect various metrics, from CPU usage to customer signups at the last minute, and store them in a time-series database. These metrics can then aggregate data, overlay them over past data, predict trends, et cetera. Often these metrics are also used to trigger alerts if the values are outside of a particular range.

Log Collection

Another critical aspect of monitoring and compliance is log collection. System logs store information such as errors emitted by applications or even audit logs of who logged in to a system. These logs need to be centralized and are often analyzed, as well.

Alerting

All the above systems can result in alerts to technical personnel. For example, when a system fails or an early warning is issued, an alerting system notifies the staff member on duty via phone call, SMS, or push notification that action needs to be taken. More advanced techniques also track if the staff member responds in time and escalates the issue.

Dashboards

A beneficial tool for staff members is a collection of dashboards to view the metrics and alerts. Well implemented dashboards are a convenient tool for staff to work on issues.

Monitoring Solutions

We list several solutions here as examples. However, of course, there are many more solutions and services, and we make no claims about the completeness or correctness of this data.

Name	Type	Periodic Checks	Metrics Collection	Log Collection	Alerting
Icinga	self hosted	X	X		X
Nagios	self hosted	X			X
Zabbix	self hosted	X	X		X
Collectd	self hosted		X		X
Munin	self hosted		X		X
Prometheus	self hosted	X	X		X
UptimeRobot	cloud	X			X
VictorOps	cloud				X
PagerDuty	cloud				X
DataDog	cloud	X	X	X	X
New Relic	cloud	X	X	X	X
ELK (Elastic Search, Logstash, Kibana)	self hosted / cloud		X	X	X