hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Faris <afa...@linkedin.com>
Subject Re: Which metrics to track?
Date Tue, 11 Sep 2012 02:31:01 GMT
>From an operations perspective, hadoop metrics are a bit different then watching hosts
behind a load balancer as one needs to start thinking in terms of distributed systems and
not individual hosts.  The reason being that the hadoop platform is fairly resilient against
multiple node failures, if we loose an entire rack of data nodes due to a switch failure the
platform might be degraded but isn't down.  It's still advisable to collect cpu, network,
ram and disk usage on individual hosts using something like collectd or ganglia, but focus
on the  display aspect.  The ability to aggregate individual metrics into a global graph and
understand what's going on with the platform while particular jobs are running is very helpful.
 Regarding JMX, you may have to look at hadoop source for explanations but there's not a lot
of cruft in JMX output.  

Hadoop is good about detecting error conditions on data nodes and should be leveraged in monitoring
and metric solutions.   Like most monitoring & metric systems, start slow and ramp up
as you find new conditions.  Here's something to get you started with a 1.x grid.  

Namenode:
'PercentUsed' or 'PercentRemaining'. You should poll either value and start to worry about
block corruption when HDFS usage hits 80% used. :)  
'CorruptBlocks', 'UnderReplicatedBlocks', 'MissingBlocks', & 'FSState' will alert you
to HDFS issues.
'LiveNodes' or 'DeadNodes': let hadoop monitor for the datanode process as it's a built in
freebie.
'FilesTotal': using the rule of thumb of 1GB ram for every million files on HDFS, allows you
to track and tune your heap size.

Jobtracker:
'BlacklistedNodesInfoJson' 'GraylistedNodesInfoJson':   Let hadoop monitor for broken task
trackers.
"JVM heap counters": Useful for heap tuning.
"Queue counts": When are jobs running? How many mappers/reducers are used at a particular
time?  You'll find the number of mappers/reducers active at one time more helpful then how
many jobs are running.  

Datanode:
"heartBeats_avg_time": if data node has high heartbeat, it could be network congestion or
high load on local box.
"VolumeInfo": shows local filesystem sizes. (You could also use ganglia/collectd for this).
 Note that unless you define separate partitions for spill space and HDFS blocks, when the
task spills to disk you could fill your local datanode filesystem.
 
Tasktracker:
The tasktracker has stats that could be worth looking at, like the shuffle counters to know
when a job spills to disk.  Currently we aren't using any of these values so I don't have
recommendations. 


-- Adam


On Sep 10, 2012, at 1:16 PM, Jones, Robert wrote:

> Hello all, I am a sysadmin and do not know that much about Hadoop.  I run a stats/metrics
tracking system that logs stats over time so you can look at historical and current data and
perform some trend analysis.  I know I can access several hadoop metrics via jmx by going
to http://localhost:50070/jmx?qry=hadoop:* and I've got a script that parses all that data
so that I can stuff any of it that I want into our stats system.
> 
> So, with all that preamble, I have one question.  Which metrics are worth tracking? 
There are a *lot* of metrics returned via the jmx query but I doubt all of them are critically
important.  Which metrics are important to track if we want to watch for any trends or spikes
in our hadoop cluster?
> 
> Please provide some education to this noob.  Thanks!
> 
> --
> Bob Jones
> Linux Systems/Network Engineer
> ME Cloud Computing
> Advanced Systems Development, Inc.
> 1 (434) 964-3156
> 
> 
> 
> ****************************************************************************** This message
and any attachments are solely for the intended recipient and may contain confidential or
privileged information. If you are not the intended recipient, any disclosure, copying, use,
or distribution of the information included in this message and any attachments is prohibited.
If you have received this communication in error, please notify us by reply e-mail and immediately
and permanently delete this message and any attachments. Any views or opinions presented in
this email are solely those of the author and do not necessarily represent those of ASD. Employees
of ASD are expressly required not to make defamatory statements and not to infringe or authorize
any infringement of copyright or any other legal right by email communications. Any such communication
is contrary to company policy and outside the scope of the employment of the individual concerned.
The company will not accept any liability in respect. ******************************************************************************


Mime
View raw message