Return-Path: Delivered-To: apmail-hadoop-common-commits-archive@www.apache.org Received: (qmail 60025 invoked from network); 16 Nov 2009 11:18:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Nov 2009 11:18:25 -0000 Received: (qmail 56054 invoked by uid 500); 16 Nov 2009 11:18:24 -0000 Delivered-To: apmail-hadoop-common-commits-archive@hadoop.apache.org Received: (qmail 55985 invoked by uid 500); 16 Nov 2009 11:18:24 -0000 Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-commits@hadoop.apache.org Received: (qmail 55976 invoked by uid 500); 16 Nov 2009 11:18:24 -0000 Delivered-To: apmail-hadoop-core-commits@hadoop.apache.org Received: (qmail 55973 invoked by uid 99); 16 Nov 2009 11:18:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Nov 2009 11:18:24 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Nov 2009 11:18:21 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id B337116E07; Mon, 16 Nov 2009 11:18:00 +0000 (GMT) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Mon, 16 Nov 2009 11:18:00 -0000 Message-ID: <20091116111800.174.42342@eos.apache.org> Subject: =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22ManagementTools=22_by_SteveLough?= =?utf-8?q?ran?= X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch= ange notification. The "ManagementTools" page has been changed by SteveLoughran. The comment on this change is: New page on cluster management tools. http://wiki.apache.org/hadoop/ManagementTools -------------------------------------------------- New page: =3D Hadoop Cluster Management Tools =3D On a big cluster you don't want to have your phone page you every time a no= de goes down. The only invididual machines you care about are the NameNode,= the Secondary NameNode and the JobTracker. Worker nodes come and go. What = matters there is the total cluster availability, the availability of the li= ve data, and whether the rate of node failure is too high to get useful wor= k done. The other thing to be aware of is that the troublesome workers are not the = dead ones; they are easy to detect; they don't report for duty. The trouble= some ones are the nodes where the disk is playing up so badly that the syst= em is really slow, so their work takes too long. Or their RAM isn't working= properly so only 1GB of it appears there, and every job fails with memory = problems. Or some strange motherboard/CPU/OS combination causes a machine t= o find race conditions in code where none surface elsewhere. That's what yo= u need to identify: the troublemakers. Once found, you can set up Hadoop to= blacklist nodes. =3D=3D Nagios =3D=3D There is support for Nagios in Hadoop. =3D=3D Ganglia =3D=3D There is support for Ganglia in Hadoop. =3D=3D JMX Support =3D=3D Hadoop has JMX support, so with the right JMX bridge for your chosen manage= ment tools, it should be possible to keep an eye on Hadoop from your favori= te management console. =3D=3D=3D JMX Bridging to Zenoss =3D=3D=3D Allen at LinkedIn says "We've working on getting our stats into Zenoss via the JMX connector and SNMP because Ganglia seems to have some fundamental issues (like grouping of hosts is a *client* side config). Note that Zenoss is available in both open source and commercial forms. We're using the commercial version, but the open source version would probably be just as good. But that aside: We're taking the approach of grid health by watching and monitoring the dead/live node count by scraping the NN and JT web pages. We also do daily fsck's, lsr's, and run a cut-down version of gridmix. While monitoring individual nodes is useful in a pro-active sense, the bigger your grid gets, the less important it becomes"