hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harish Mallipeddi <harish.mallipe...@gmail.com>
Subject Re: File Chunk to Map Thread Association
Date Thu, 20 Aug 2009 09:30:36 GMT
On Thu, Aug 20, 2009 at 2:39 PM, roman kolcun <roman.wsmo@gmail.com> wrote:

>
> Hello Harish,
>
> I know that TaskTracker creates separate threads (up to
> mapred.tasktracker.map.tasks.maximum) which execute the map() function.
> However, I haven't found the piece of code which associate FileSplit with
> the given map thread. Is it downloaded locally in the TaskTracker function
> or in MapTask?
>
>
>
Yes this is done by the MapTask.


>
> I know I can increase the input file size by changing
> 'mapred.min.split.size' , however, the file is split sequentially and very
> rarely two consecutive HDFS blocks are stored on a single node. This means
> that the data locality will not be exploited cause every map() will have to
> download part of the file from network.
>
> Roman Kolcun
>

I see what you mean - you want to modify the hadoop code to allocate
multiple (non-sequential) data-local blocks to one MapTask. I don't know if
you'll achieve much by doing all that work. Hadoop lets you reuse the
launched JVMs for multiple MapTasks. That should minimize the overhead of
launching MapTasks.
Increasing the DFS blocksize for the input files is another means to achieve
the same effect.


-- 
Harish Mallipeddi
http://blog.poundbang.in

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message