wot.io Monitoring Systems and Services
Monitoring is a vital tool when developing, optimizing and understanding the health of your application services and infrastructure. wot.io has several data monitoring services in our data service exchange™ and we deploy and use a few of these as part of our own monitoring system. In this blog we're outlining how we use these monitoring services with a tour of our virtual machines, message bus and third party data services in our data service exchange.
Our monitoring setup can be broken down into 3 basic parts:
- automated deployment
- historical metric collection
- host checks and alerting
We use the power of docker and wot.io's configuration service for automated service deployment. Each newly deployed virtual machine (VM) automatically spins up with a default set of monitoring client containers.
- collectd: host metric collection
- dockerstats: container metric collection
- sensu-client: runs host, container and application checks
rsyslog-forwarder: forwards logs to a remote server.
not covered here
We use a Graphite server fronted by Tessera dashboards to collect and view our historical metrics. By default, we collect metrics for each host (collectd) and all of its running containers (dockerstats). Both containers send metrics to a Graphite server; which Tessera queries to populate its dashboards.
Let's take a look at our default dashboards that are generated when we provision a new VM. This is accomplished by posting a json dashboard definition to a Tessera host.
Default tessera dashboards
Tessera and collectd in action
Checks and alerts
The final piece of our monitoring system is Sensu. Sensu is written in Ruby, backed by RabbitMQ and uses Nagios-style checks to alert us when bad things happen; or in some cases when bad things are about to happen. By default sensu-client gives us a basic keep alive. We have added our own checks to notify us when other more specific problems arise.
- container checks: verifies that all the containers that are configured to run on that host are indeed running
- host checks: lets us know if we are running over 90% usage on cpu, memory or disk
- application checks: sensu-client will run all checks placed in the
/checksdir of any container running on that host
We use the standard 4 Nagios levels:
- ok: exit code 0
- warning: exit code 1
- critical: exit code 2
- unknown: exit code 3
Ok, warning and unknown alerts are sent as emails and slack posts. We reserve critical alerts for big things like
containers not running and
host has stopped sending keepalives. Critical alerts go straight to PagerDuty and our on-call team.
Example sensu container check
Example sensu application check
As described above, we use these tools to monitor and collect data on our systems and also make them available to customers if they have additional needs for these data services. And the integration into our deployment system automatically launches the appropriate agents, which is essential when we deploy a large number of services as once, like we did for the LiveWorx Hackathon.