hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Poor data locality of MR job
Date Thu, 02 Aug 2012 18:37:03 GMT
On Wed, Aug 1, 2012 at 11:31 PM, Bryan Keller <bryanck@gmail.com> wrote:
> I have an 8 node cluster and a table that is pretty well balanced with on average 36
regions/node. When I run a mapreduce job on the cluster against this table, the data locality
of the mappers is poor, e.g 100 rack local mappers and only 188 data local mappers. I would
expect nearly all of the mappers to be data local. DNS appears to be fine, i.e. the hostname
in the splits is the same as the hostnames in the task attempts.

Thanks for looking at this already, it's the first thing that came in
mind when looking at the title.

> The table isn't new and from what I understand, HDFS replication will eventually keep
region data blocks local to the regionserver. Are there other reasons for data locality to
be poor and any way to fix it?

Block locality doesn't play a role here, TableInputFormat publishes
where the region is but then where the data belonging to that region
is is another matter that's not taken into account. In any case, even
if the region was on node A and all the data happened to be on node B,
C, and D, your mapper would still only talk to A since that's where
the region is.

So only 1 region server serves a region meaning that there's only 1
node where you can send the map to in order to have data locality. I'm
ready to guess that the the maps that aren't local are launched
towards the end of the job, because you might have slower maps and/or
not a perfect balance of regions per region server.

For example, let's say one node is full of data-local maps but it
still has 2-3 more to process while other nodes have availability. The
JT has a locality timeout for each map so that if one node is just too
busy it will fall back to rack-local nodes instead. In this example
those 2-3 maps might get sent elsewhere.

There are ways to tune this depending on which scheduler you are
using, but it will mostly involve waiting more for each task to make
sure they can get to the right node.

At your scale it sounds more to me like over-optimizations. How big of
a hit are you taking?


View raw message