Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of
 prvs=5948e8eb4=afaris@linkedin.com designates 69.28.149.81 as permitted
 sender)
From: Adam Faris <afaris@linkedin.com>
To: "<user@hadoop.apache.org>" <user@hadoop.apache.org>
Subject: Re: Which metrics to track?
Thread-Topic: Which metrics to track?
Thread-Index: AQHNj5Ephc/IDMVw10e+WJ9hi+WjapeE4UkA
Date: Tue, 11 Sep 2012 02:31:01 +0000
Message-ID: <F89A1949-2FD6-449D-8678-2536D08160EA@linkedin.com>
References: 
 <CAB4D511D566174D84C9B89FA30E1F807D3539@mbx025-w1-ca-7.exch025.domain.local>
In-Reply-To: 
 <CAB4D511D566174D84C9B89FA30E1F807D3539@mbx025-w1-ca-7.exch025.domain.local>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-ID: <066CE6089EE3064E84523DC7D13976C0@linkedin.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

>From an operations perspective, hadoop metrics are a bit different then wat=
ching hosts behind a load balancer as one needs to start thinking in terms =
of distributed systems and not individual hosts.  The reason being that the=
 hadoop platform is fairly resilient against multiple node failures, if we =
loose an entire rack of data nodes due to a switch failure the platform mig=
ht be degraded but isn't down.  It's still advisable to collect cpu, networ=
k, ram and disk usage on individual hosts using something like collectd or =
ganglia, but focus on the  display aspect.  The ability to aggregate indivi=
dual metrics into a global graph and understand what's going on with the pl=
atform while particular jobs are running is very helpful.  Regarding JMX, y=
ou may have to look at hadoop source for explanations but there's not a lot=
 of cruft in JMX output. =20

Hadoop is good about detecting error conditions on data nodes and should be=
 leveraged in monitoring and metric solutions.   Like most monitoring & met=
ric systems, start slow and ramp up as you find new conditions.  Here's som=
ething to get you started with a 1.x grid. =20

Namenode:
'PercentUsed' or 'PercentRemaining'. You should poll either value and start=
 to worry about block corruption when HDFS usage hits 80% used. :) =20
'CorruptBlocks', 'UnderReplicatedBlocks', 'MissingBlocks', & 'FSState' will=
 alert you to HDFS issues.
'LiveNodes' or 'DeadNodes': let hadoop monitor for the datanode process as =
it's a built in freebie.
'FilesTotal': using the rule of thumb of 1GB ram for every million files on=
 HDFS, allows you to track and tune your heap size.

Jobtracker:
'BlacklistedNodesInfoJson' 'GraylistedNodesInfoJson':   Let hadoop monitor =
for broken task trackers.
"JVM heap counters": Useful for heap tuning.
"Queue counts": When are jobs running? How many mappers/reducers are used a=
t a particular time?  You'll find the number of mappers/reducers active at =
one time more helpful then how many jobs are running. =20

Datanode:
"heartBeats_avg_time": if data node has high heartbeat, it could be network=
 congestion or high load on local box.
"VolumeInfo": shows local filesystem sizes. (You could also use ganglia/col=
lectd for this).  Note that unless you define separate partitions for spill=
 space and HDFS blocks, when the task spills to disk you could fill your lo=
cal datanode filesystem.
=20
Tasktracker:
The tasktracker has stats that could be worth looking at, like the shuffle =
counters to know when a job spills to disk.  Currently we aren't using any =
of these values so I don't have recommendations.=20


-- Adam


On Sep 10, 2012, at 1:16 PM, Jones, Robert wrote:

> Hello all, I am a sysadmin and do not know that much about Hadoop.  I run=
 a stats/metrics tracking system that logs stats over time so you can look =
at historical and current data and perform some trend analysis.  I know I c=
an access several hadoop metrics via jmx by going to http://localhost:50070=
/jmx?qry=3Dhadoop:* and I've got a script that parses all that data so that=
 I can stuff any of it that I want into our stats system.
>=20
> So, with all that preamble, I have one question.  Which metrics are worth=
 tracking?  There are a *lot* of metrics returned via the jmx query but I d=
oubt all of them are critically important.  Which metrics are important to =
track if we want to watch for any trends or spikes in our hadoop cluster?
>=20
> Please provide some education to this noob.  Thanks!
>=20
> --
> Bob Jones
> Linux Systems/Network Engineer
> ME Cloud Computing
> Advanced Systems Development, Inc.
> 1 (434) 964-3156
>=20
>=20
>=20
> *************************************************************************=
***** This message and any attachments are solely for the intended recipien=
t and may contain confidential or privileged information. If you are not th=
e intended recipient, any disclosure, copying, use, or distribution of the =
information included in this message and any attachments is prohibited. If =
you have received this communication in error, please notify us by reply e-=
mail and immediately and permanently delete this message and any attachment=
s. Any views or opinions presented in this email are solely those of the au=
thor and do not necessarily represent those of ASD. Employees of ASD are ex=
pressly required not to make defamatory statements and not to infringe or a=
uthorize any infringement of copyright or any other legal right by email co=
mmunications. Any such communication is contrary to company policy and outs=
ide the scope of the employment of the individual concerned. The company wi=
ll not accept any liability in respect. ***********************************=
*******************************************