hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edmund Kohlwey <ekohl...@gmail.com>
Subject Re: return in map
Date Sun, 06 Dec 2009 15:52:40 GMT
Let me see if I understand:
The mapper is reading lines in a text file. You want to see if a single
line meets a given criteria, and emit all the lines whose index is
greater than or equal to the single matching line's index. I'll assume
that if more than one line meets the criteria, you have a different
condition which you will handle appropriately.

First some discussion of your input- is this a single file that should
be considered as a whole? In that case, you probably only want one
mapper, which, depending on your reduce task, may totally invalidate the
use case for MapReduce. You may just want to read the file directly from
HDFS and write to HDFS in whatever application is using the data.

Anyways, here's how I'd do it. In setup, open a temporary file (it can
be directly on the node, or on HDFS, although directly on the node is
preferable). Use map to perform your test, and keep a counter of how
many lines match. After the first line matches, begin saving lines. If a
second line matches, log the error condition or whatever. In cleanup, if
only one line matched, open your temp file and begin emitting the lines
you saved from earlier.

There's a few considerations in your implementation:
1. File size. If the temporary file exceeds the available space on a
mapper, you can make a temp file in HDFS but this is far from ideal.
2. As noted above, if there's a single mapper and no need to sort or
reduce the output, you probably want to just implement this as a program
that happens to be using HDFS as a data store, and not bother with
MapReduce at all.

On 12/6/09 10:03 AM, Sonal Goyal wrote:
> Hi,
> Maybe you could post your code/logic for doing this. One way would be to set
> a flag once your criteria is met and emit keys based on the flag.
> Thanks and Regards,
> Sonal
> 2009/12/5 Gang Luo <lgpublic@yahoo.com.cn>
>> Hi all,
>> I got a tricky problem. I input a small file manually to do some filtering
>> work on each line in map function. I check if the line satisfy the constrain
>> then I output it, otherwise I return, without doing any other work below.
>> For the map function will be called on each line, I think the logic is
>> correct. But it doesn't work like this. If there are 5 line for a map task,
>> and only the 2nd line satisfies the constrain, then the output will be line
>> 2, 3, 4, and 5. If the 3rd line satisfies, then output will be line 3, 4, 5.
>> It seems that once a map task meet the first satisfying line, the filter
>> doesn't work for following lines.
>> It is interesting problem. I am checking it now. I also hope someone could
>> give me some ideas on this. Thanks.
>> -Gang
>>      ___________________________________________________________
>>  好玩贺卡等你发,邮箱贺卡全新上线!
>> http://card.mail.cn.yahoo.com/

View raw message