hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Luo <lgpub...@yahoo.com.cn>
Subject map side only behavior
Date Fri, 29 Jan 2010 15:40:21 GMT
Hi all,
If I only use map side to process my data (set # of reducers to 0 ), what is the behavior
of hadoop? Will it merge and sort each of the spills generated by one mapper?


----- 原始邮件 ----
发件人: Gang Luo <lgpublic@yahoo.com.cn>
收件人: common-user@hadoop.apache.org
发送日期: 2010/1/29 (周五) 8:54:33 上午
主   题: Re: fine granularity operation on HDFS

Yeah, I see how it works. Thanks Amogh.


----- 原始邮件 ----
发件人: Amogh Vasekar <amogh@yahoo-inc.com>
收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
发送日期: 2010/1/28 (周四) 10:00:22 上午
主   题: Re: fine granularity operation on HDFS

Hi Gang,
Yes PathFilters work only on file paths. I meant you can include such type of logic at split
The input format's getSplits() method is responsible for computing and adding splits to a
list container, for which JT initializes mapper tasks. You can override the getSplits() method
to add only a few , say, based on the location or offset, to the list. Here's the reference
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
          bytesRemaining -= splitSize;

        if (bytesRemaining != 0) {
          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,

Before splits.add you can use your logic for discarding. However, you need to ensure your
record reader takes care of incomplete records at boundaries.

To get the block locations to load separately, the FileSystem class APIs expose few methods
like getBlockLocations etc ..
Hope this helps.


On 1/28/10 7:26 PM, "Gang Luo" <lgpublic@yahoo.com.cn> wrote:

Thanks Amogh.

For the second part of my question, I actually mean loading block separately from HDFS. I
don't know whether it is realistic. Anyway, for my goal is to process different division of
a file separately, to do that at split level is OK. But even I can get the splits from inputformat,
how to "add only a few splits you need to mapper and discard the others"? (pathfilters only
works for file, but not block, I think).




View raw message