hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Lilley <john.lil...@redpoint.net>
Subject Assignment of data splits to mappers
Date Thu, 13 Jun 2013 17:57:33 GMT
When MR assigns data splits to map tasks, does it assign a set of non-contiguous blocks to
one map?  The reason I ask is, thinking through the problem, if I were the MR scheduler I
would attempt to hand a map task a bunch of blocks that all exist on the same datanode, and
then schedule the map task on that node.  E.g. if I have an HDFS file with 10000 blocks and
I want to create 1000 map tasks I'd like each map task to have 10 blocks, but those blocks
are unlikely to be contiguous on a given datanode.

This is related to a question I had asked earlier, which is whether any benefit could be had
by aligning data splits along block boundaries to avoid slopping reads of a block to the next
block and requiring another datanode connection.  The answer I got was that the extra connection
overhead wasn't important.  The reason I bring this up again is that comments in this discussion
(https://issues.apache.org/jira/browse/HADOOP-3315) imply that doing an extra seek to the
beginning of the file to read a magic number on open is a significant overhead, and this looks
like a similar issue to me.


View raw message