Toggle Menu Visibility











Execution framework

The execution framework should consists of the following components:
  • Config/Plugin loader - read the configuration and load appropriate plugins and inits them with the data from config file.
  • Main Loop - create environment for each test to run independently. Monitor the relationships between the tests and skip the unnecessary tests.
  • Datacollection - collect the result from tests and store them in the DB (when available).
The main loop call each of the plugins once per minute. Each plugin should work asynchronous and when ready should pass the result to the system and release all the resources. If the previous test wasn't finished, we don't start a new one.

Challanges:
  1. the overload of the system: we must make sure that the system is capable of monitoring 500 services even when half of them are down.
  2. resource problem: we must make sure the monitoring server won't run out of sockets, memory, etc.
  3. timing: all the tests should be finished within 1 minute, in order the next one to start.
  4. transparency for the monitored service. We must not screw up the production services, because we are monitoring them.

 

Database

The database should keep the datatimestamped results from each test agains every machine according to the configuration. When a service fails, all dependant services won't have any value in the DB. Result number will represent the service availability/performance. For the services that have only 2 states: Available=1, Not Available=-1. For the services whose availability can be measured (like roundtrip icmp time, response time, download time,etc), we store here the number of ms for this operation or -1 if the operation failed within defined timeout. Some additional information for each test should be stored in note field.


Note: the system should store the SQL updates into a file, when the SQL server is away, and flush all when the server become available.

Note2: In order to avoid overload of DB, we should provide some utility for migration/backing-up a full day/week/month of monitoring on different server.

 

Remote Probes

In order to monitor the local machine resources, we need to have a small non-intelligent modules installed on every machine.
These modules will be responsible for gathering every minute information like: CPU ut
ilization (plus top 5 processes), Mem usage (plus top 10 processes), HDD utilization, HDD free space, LAN utilization, fatal errors from log files, etc.

 

 

The communication between probes and monitoring server will be initiated from the server. In case of server down, the probe should store the data locally (up to 1-2h) and give it back to server when he asked for it. We can use custom format, XML-based protocol or sql insert commands to exchange data, and the transport can be TCP or spread (http://www.spread.org/).

The probe plugin will send request to each probe once per minute, and will store the result in the DB.

 

Error Logging

The activity logs should provide brief information about execution of the process. Example:

[DateTimeStamp] [ProcessingTime] icmp: 41/0, port: 23/0, http: 12/0, https: 3/0, ftp: 1/0, mysql: 1/0, dcp: 12/0 , real: 1/0, mail: 2/0, probe: 6/0, imagen: 2/0

We gonna have one line per minute in the log file.

The number 41/0 represent that we scanned 41 icmp ports and 0 ware not available.

We should support optional DEBUG level of logs, where each plugin gives detailed information about each test. We should be able to turn on/off DEBUG level only on a specific plugin.

The Error log should inform us on all fatal problems (like can't connect to DB, no available sockets, missing remote probe, etc). Unavalability of some of the monitored service won't be considered an fatal error.

 

Escalation System

The escalation system should be fully independent from data collection system, and the only interface between the two systems should be the DB. The system should have fully configurable rules to report problems based on events in the DB, and number a certain events occured in a given time frame, combination of evens, etc.
The system should send email/SMS/Pager/SNMP trap or post a defect in DTS (again by email). The sending of mail should work even if the primary mail server is down.

Once the problem was reported, the system should be configurable for the following behavior (per issues):

  1. continue to send messages every XX (like 30) min until the problem is fixed.
  2. send a message when the problem is fixed.

 

The escalation system should provide interface to schedule maintanence on certain host/servers/datacenters, in which case the error reported should be temporary disabled for this period of time.

 

Visualization System

The visualization system should be fully independent from data collection system, and the only interface between the two systems should be the DB. The system should be per-user preferences(XML based), where the user can define set of pannels, and in each panel a list of tests. Each user should be able to define the layout on the screen (how many pannels, how wide, how many columns, etc). For each test the system should check for the 3 thresholds (normal, warning, and fatal), and shoule display the value in green, yellow and red. The definitions of the thresholds should allow using other values like: HIT_SEC * 2. Also some results can be hidden based on some condition. The screen should be refreshed once per minute.
The pannel configuration should be exportable as XML (to give your pannels to your buddy)

The DB from all datacenters should be replicated to the place where the visualization and escalation system should run.

A full graph of every datacenter should be automatically created using the results from the tests. The graph should include the list of machines and list of IPs and list of services running on each machine (based on monitoring), and the mouse-over on the services, should provide a list of virtual hosts or other configuration information. The results from testing of every service should be available (incl the color coding), so when one machine has some issue, it will become red. the diagram should look similar to Whats Up Gold from IpSwitch can do.

Reporting System
The reporting system should be fully independent from data collection system, and the only interface between the two systems should be the DB. The system should have per-user preferences (XML Based), where each user can define the report that he wants to receive every hours/day/month. In the report he can include the summary from a given plugin(s), list of all accidents or some abnormal conditions.