Today (Apr 7 2004) at ClusterExpo (http://www.clusterworldexpo.com) I paid a visit to Ganglia (http://ganglia.sourceforge.net) presentation by Matt Massie (http://www.cs.berkeley.edu/~massie/). I spoke with him for 20 min after the presentation and we exchanged cards.

Ganglia is a monitoring system (developed in Berkeley), used in many grids/cluster, including the one based on Rocks Cluster (http://www.rocksclusters.org/) -- very popular cluster management system.
Basically the system is collecting CPU, Mem, HDD, LAN data from every machine in the cluster and it's sending this data to all other machines in the cluster (over multicast). Of course the multicast is not routable most of the time. This way every machine in the cluster knows the status of all others. Of course this makes the traffic to each node to be enormous, but they are doing some optimization to avoid sending the packet if the values are close to the last one and the last send value was close enough (time + value trashhold on send). Of course this is not working if you have very dynamic/jumpy environment.
The management node is sucking data from any of the grid nodes (as all they have the same information). In order to solve the scalability problem (why they had to create it on the first place?) they created some aggregation module that sends a summary of all the nodes from a small grid to the upper level where they can be displayed together with other small grids.
But if you want to see deeper, you are getting a link to the web server of the machine that is collecting the data for this specific grid.
The fastest update pace they recommend is 15 sec (and only for some of the counters). As they are sending a multicast datagram for each single change -- this is loading the network a lot if you doing 1 sec pace with 500+ nodes....
There is no way to see together (other than open two browsers) the values coming from different segments. For example there is no way to set an alarm (there are no alarms for now, but they will be soon) if node #23 in segment A has > 10G free mem and node #444 in segment B has < 1Mb free.
I find the design of the system not bad for the university project (and far away from commercial quality), but this system has huge market penetration. Maybe I really have to leave you guys to write crappy software;-)
I didn't saw anything interesting on the expo... Vlad and Essy was there too. Joro may write some report for what he saw...
Lenkov