hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer <awittena...@linkedin.com>
Subject Re: distcp questions
Date Mon, 16 Aug 2010 06:43:38 GMT

On Aug 15, 2010, at 10:34 AM, Kris Jirapinyo wrote:
> 1) Our new cluster has 25 machines but 100 mappers.  When distcp is triggered, it seems
to allocate 4 mappers per machine.  Is this normal? The issue here is that say distcp only
needs 8 mappers, I would think that distcp would try to distribute those to different machines
so that perhaps IO will not be saturated on one machine.  What I've been seeing is that for
those 8 map tasks, 4 are assigned to one machine and 4 to the other, as opposed to 8 being
assigned do a different machine altogether.

I don't think distcp (or any other job, for that matter) can provide hints to the scheduler
about how its tasks should be distributed, other than pointing to its input files.  So very
likely, the distcp's input files are on those nodes where the tasks are located.

You can always try to bump up the replication as part of the distcp's parameters.

View raw message