hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Luo <lgpub...@yahoo.com.cn>
Subject RE: Re: return in map
Date Mon, 07 Dec 2009 18:46:53 GMT
Thanks. It helps.


----- 原始邮件 ----
发件人: Amogh Vasekar <amogh@yahoo-inc.com>
收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
发送日期: 2009/12/7 (周一) 12:43:07 上午
主   题: Re: Re:  return in map

If the file doesn’t exist, java will error out.
For partial skips, o.a.h.mapreduce.Mapper class provides a method run(), which determines
if the end of split is reached and if not, calls map() on your <k,v> pair. You may override
this method to include flag checks too and if that fails, the remaining split may be skipped.
Hope this helps.


On 12/7/09 6:38 AM, "Edmund Kohlwey" <ekohlwey@gmail.com> wrote:

As far as I know (someone please correct me if I'm wrong), mapreduce
doesn't provide a facility to signal to stop processing. You will simply
have to add a field to your mapper class that you set to signal an error
condition, then in map check if you've set the error condition and
return on each call if its been set.

On 12/6/09 6:55 PM, Gang Luo wrote:
> Thanks for reponse.
> It seems there is something wrong in my logic. I kind of solve it now. What I am still
unsure of is how to return or exit in a mapreduce program. If I want to skip one line (because
it doesn't satisfy some constrains, for example), use return to quit map function is enough.
But what if I want to quit a map task (due to some error I detect, for example, the file I
want to read doesn't exist)? if use System.exit(), hadoop will try to run it again. Similarly,
if I catch an exception and I want to quit the current task, what should I do?
> -Gang
> ----- ԭʼ�ʼ� ----
> �����ˣ� Edmund Kohlwey <ekohlwey@gmail.com>
> �ռ��ˣ� common-user@hadoop.apache.org
> �������ڣ� 2009/12/6 (����) 10:52:40 ����
> ��   �⣺ Re: return in map
> Let me see if I understand:
> The mapper is reading lines in a text file. You want to see if a single
> line meets a given criteria, and emit all the lines whose index is
> greater than or equal to the single matching line's index. I'll assume
> that if more than one line meets the criteria, you have a different
> condition which you will handle appropriately.
> First some discussion of your input- is this a single file that should
> be considered as a whole? In that case, you probably only want one
> mapper, which, depending on your reduce task, may totally invalidate the
> use case for MapReduce. You may just want to read the file directly from
> HDFS and write to HDFS in whatever application is using the data.
> Anyways, here's how I'd do it. In setup, open a temporary file (it can
> be directly on the node, or on HDFS, although directly on the node is
> preferable). Use map to perform your test, and keep a counter of how
> many lines match. After the first line matches, begin saving lines. If a
> second line matches, log the error condition or whatever. In cleanup, if
> only one line matched, open your temp file and begin emitting the lines
> you saved from earlier.
> There's a few considerations in your implementation:
> 1. File size. If the temporary file exceeds the available space on a
> mapper, you can make a temp file in HDFS but this is far from ideal.
> 2. As noted above, if there's a single mapper and no need to sort or
> reduce the output, you probably want to just implement this as a program
> that happens to be using HDFS as a data store, and not bother with
> MapReduce at all.
> On 12/6/09 10:03 AM, Sonal Goyal wrote:
>> Hi,
>> Maybe you could post your code/logic for doing this. One way would be to set
>> a flag once your criteria is met and emit keys based on the flag.
>> Thanks and Regards,
>> Sonal
>> 2009/12/5 Gang Luo <lgpublic@yahoo.com.cn>
>>> Hi all,
>>> I got a tricky problem. I input a small file manually to do some filtering
>>> work on each line in map function. I check if the line satisfy the constrain
>>> then I output it, otherwise I return, without doing any other work below.
>>> For the map function will be called on each line, I think the logic is
>>> correct. But it doesn't work like this. If there are 5 line for a map task,
>>> and only the 2nd line satisfies the constrain, then the output will be line
>>> 2, 3, 4, and 5. If the 3rd line satisfies, then output will be line 3, 4, 5.
>>> It seems that once a map task meet the first satisfying line, the filter
>>> doesn't work for following lines.
>>> It is interesting problem. I am checking it now. I also hope someone could
>>> give me some ideas on this. Thanks.
>>> -Gang
>>>      ___________________________________________________________
>>>  ����ؿ����㷢������ؿ�ȫ�����ߣ�
>>> http://card.mail.cn.yahoo.com/ 
>       ___________________________________________________________
>   ����ؿ����㷢������ؿ�ȫ�����ߣ�
> http://card.mail.cn.yahoo.com/ 


View raw message