WatchCub is service availability and performance monitoring system. It provides extensible framework for creating cost-effective performance monitoring and problem escalation systems.
The main goals in design of this system are:
- availability and performance real time monitoring
- performace evaluating is mimic of real-life situations
- fully independent monitoring, escalating, reporting and visualization systems.
- collect and correlate results from several monitors in order to avoid false alarms from failed sensor.
- flexible threshold definition allow skipping of temp anomalies and alarm only on real threat.
- XML configuration, supporting pattern-based test configuration with inheritance and param overwriting.
- scalable architecture up to tens of thousands of monitoring points.
- multi-threaded test execution framework, giving full independance of every plugin/test
- asynchronus (state-machine) test design, allowing running of multiple concurrent test with small CPU overhead.
- easy development of new monitoring plug-ins using Perl or C
- dependency in tests (won't run http tests when the port 80 is down)
- allow attaching a remote probes for tests like CPU load, mem usage, etc that can't be performed remotly.
- agregation and consolidation will optimize data storage
- don't add extra load on monitored service and network infrastrcure
The system contains the following components:
- monitoring plug-ins and corresponding probes (responsible for execution of the tests and providing the results)
- execution framework (responsible for running the plug-ins and checking the dependencies and collect/store the results)
- escalation system (responsible for detecting of troubles/anomalies and escalating the notification)
- reporting system (generating configurable reports based on the collected data, sent by email/web)
- visualization system (provide live/configurable view on collected results over the web)