Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 671D4D9CE for ; Tue, 11 Sep 2012 02:31:55 +0000 (UTC) Received: (qmail 66549 invoked by uid 500); 11 Sep 2012 02:31:50 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 66462 invoked by uid 500); 11 Sep 2012 02:31:50 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 66450 invoked by uid 99); 11 Sep 2012 02:31:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Sep 2012 02:31:50 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of prvs=5948e8eb4=afaris@linkedin.com designates 69.28.149.81 as permitted sender) Received: from [69.28.149.81] (HELO esv4-mav05.corp.linkedin.com) (69.28.149.81) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Sep 2012 02:31:46 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linkedin.com; i=@linkedin.com; q=dns/txt; s=proddkim1024; t=1347330705; x=1378866705; h=from:to:subject:date:message-id:references:in-reply-to: content-id:content-transfer-encoding:mime-version; bh=yemFAO1W5fuaistdOXMaxfxLfVM+aj7AOY7bk9mvGeQ=; b=KmL2j/F9BK7bS4C7lOu2M+AO3OCBj5EEVPtXbfTGjQXjg2RjYfxXctmx wjA4rAyRcG0ijVuilfxaIfSJdwjzEXteFeRG1em7pxpb4pZVlKJ0990gK rFzF7/73OK09i7Yds2ZDSD0pPTN1yZmlRknhOB1qAO9ss/CalZkGwvFN6 8=; X-IronPort-AV: E=Sophos;i="4.80,400,1344236400"; d="scan'208,223";a="25130391" Received: from ESV4-EXC01.linkedin.biz ([fe80::d7c:dc04:aea1:97d7]) by esv4-cas01.linkedin.biz ([172.18.46.140]) with mapi id 14.01.0355.002; Mon, 10 Sep 2012 19:31:01 -0700 From: Adam Faris To: "" Subject: Re: Which metrics to track? Thread-Topic: Which metrics to track? Thread-Index: AQHNj5Ephc/IDMVw10e+WJ9hi+WjapeE4UkA Date: Tue, 11 Sep 2012 02:31:01 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.18.46.247] Content-Type: text/plain; charset="us-ascii" Content-ID: <066CE6089EE3064E84523DC7D13976C0@linkedin.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org >From an operations perspective, hadoop metrics are a bit different then wat= ching hosts behind a load balancer as one needs to start thinking in terms = of distributed systems and not individual hosts. The reason being that the= hadoop platform is fairly resilient against multiple node failures, if we = loose an entire rack of data nodes due to a switch failure the platform mig= ht be degraded but isn't down. It's still advisable to collect cpu, networ= k, ram and disk usage on individual hosts using something like collectd or = ganglia, but focus on the display aspect. The ability to aggregate indivi= dual metrics into a global graph and understand what's going on with the pl= atform while particular jobs are running is very helpful. Regarding JMX, y= ou may have to look at hadoop source for explanations but there's not a lot= of cruft in JMX output. =20 Hadoop is good about detecting error conditions on data nodes and should be= leveraged in monitoring and metric solutions. Like most monitoring & met= ric systems, start slow and ramp up as you find new conditions. Here's som= ething to get you started with a 1.x grid. =20 Namenode: 'PercentUsed' or 'PercentRemaining'. You should poll either value and start= to worry about block corruption when HDFS usage hits 80% used. :) =20 'CorruptBlocks', 'UnderReplicatedBlocks', 'MissingBlocks', & 'FSState' will= alert you to HDFS issues. 'LiveNodes' or 'DeadNodes': let hadoop monitor for the datanode process as = it's a built in freebie. 'FilesTotal': using the rule of thumb of 1GB ram for every million files on= HDFS, allows you to track and tune your heap size. Jobtracker: 'BlacklistedNodesInfoJson' 'GraylistedNodesInfoJson': Let hadoop monitor = for broken task trackers. "JVM heap counters": Useful for heap tuning. "Queue counts": When are jobs running? How many mappers/reducers are used a= t a particular time? You'll find the number of mappers/reducers active at = one time more helpful then how many jobs are running. =20 Datanode: "heartBeats_avg_time": if data node has high heartbeat, it could be network= congestion or high load on local box. "VolumeInfo": shows local filesystem sizes. (You could also use ganglia/col= lectd for this). Note that unless you define separate partitions for spill= space and HDFS blocks, when the task spills to disk you could fill your lo= cal datanode filesystem. =20 Tasktracker: The tasktracker has stats that could be worth looking at, like the shuffle = counters to know when a job spills to disk. Currently we aren't using any = of these values so I don't have recommendations.=20 -- Adam On Sep 10, 2012, at 1:16 PM, Jones, Robert wrote: > Hello all, I am a sysadmin and do not know that much about Hadoop. I run= a stats/metrics tracking system that logs stats over time so you can look = at historical and current data and perform some trend analysis. I know I c= an access several hadoop metrics via jmx by going to http://localhost:50070= /jmx?qry=3Dhadoop:* and I've got a script that parses all that data so that= I can stuff any of it that I want into our stats system. >=20 > So, with all that preamble, I have one question. Which metrics are worth= tracking? There are a *lot* of metrics returned via the jmx query but I d= oubt all of them are critically important. Which metrics are important to = track if we want to watch for any trends or spikes in our hadoop cluster? >=20 > Please provide some education to this noob. Thanks! >=20 > -- > Bob Jones > Linux Systems/Network Engineer > ME Cloud Computing > Advanced Systems Development, Inc. > 1 (434) 964-3156 >=20 >=20 >=20 > *************************************************************************= ***** This message and any attachments are solely for the intended recipien= t and may contain confidential or privileged information. If you are not th= e intended recipient, any disclosure, copying, use, or distribution of the = information included in this message and any attachments is prohibited. If = you have received this communication in error, please notify us by reply e-= mail and immediately and permanently delete this message and any attachment= s. Any views or opinions presented in this email are solely those of the au= thor and do not necessarily represent those of ASD. Employees of ASD are ex= pressly required not to make defamatory statements and not to infringe or a= uthorize any infringement of copyright or any other legal right by email co= mmunications. Any such communication is contrary to company policy and outs= ide the scope of the employment of the individual concerned. The company wi= ll not accept any liability in respect. ***********************************= *******************************************