hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: How do map tasks get assigned efficiently?
Date Wed, 24 Oct 2012 11:51:07 GMT

Data locality only works when you actually have data on the cluster itself. Otherwise how
can the data be local. 

Assuming 3X replication, and you're not doing a custom split and your input file is splittable...

You will split along the block delineation.  So if your input file has 5 blocks, you will
have 5 mappers.

Since there are 3 copies of the block, its possible that for that map task to run on the DN
which has a copy of that block. 

So its pretty straight forward to a point. 

When your cluster starts to get a lot of jobs and a slot opens up, your job may not be data

With HBase... YMMV 
With S3 the data isn't local so it doesn't matter which Data Node gets the job. 



On Oct 24, 2012, at 1:10 AM, David Parks <davidparks21@yahoo.com> wrote:

> Even after reading O’reillys book on hadoop I don’t feel like I have a clear vision
of how the map tasks get assigned.
> They depend on splits right?
> But I have 3 jobs running. And splits will come from various sources: HDFS, S3, and slow
HTTP sources.
> So I’ve got some concern as to how the map tasks will be distributed to handle the
data acquisition.
> Can I do anything to ensure that I don’t let the cluster go idle processing slow HTTP
downloads when the boxes could simultaneously be doing HTTP downloads for one job and reading
large files off HDFS for another job?
> I’m imagining a scenario where the only map tasks that are running are all blocking
on splits requiring HTTP downloads and the splits coming from HDFS are all queuing up behind
it, when they’d run more efficiently in parallel per node.

View raw message