hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom White <...@cloudera.com>
Subject Re: Skipping Bad Records
Date Thu, 13 Oct 2011 21:31:06 GMT
Justin,

The skipping feature should really only be used when you are calling
out to a third-party library that may segfault on corrupt data, and
even then it's probably better to use a subprocess to handles it, as
Owen suggested here:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201108.mbox/%3cCAFQoU9Ekv+SBvAv-bSF5dORJO68VSj6zTqXywWUT+qHS3V3bbA@mail.gmail.com%3e.

In other cases you should handle the corrupt data in your mapper or
reducer, by catching the relevant exception, for example.

Tom

On Thu, Oct 13, 2011 at 5:41 AM, Justin Woody <justin.woody@gmail.com> wrote:
> Harsh,
>
> Thanks for the info. If I get some time maybe I can assist. I'm
> looking over your code now. For now I am failing the files with the
> mapred.max.map.failures.percent property, but I'm losing a lot of good
> data going that route.
>
> Justin
>
> On Wed, Oct 12, 2011 at 4:27 PM, Harsh J <harsh@cloudera.com> wrote:
>> Justin,
>>
>> Unfortunately not. The new API does not have a skipping feature yet
>> like the older one.
>>
>> I did get started on some work on
>> https://issues.apache.org/jira/browse/MAPREDUCE-1932 to fix this but I
>> haven't been able to find time to complete it with proper tests and
>> such. I'll try to do it within a week from now.
>>
>> On Wed, Oct 12, 2011 at 10:06 PM, Justin Woody <justin.woody@gmail.com> wrote:
>>> Can anyone confirm whether the skip options work for MR jobs using the
>>> new API? I have a job using the new API and I cannot get the job to
>>> skip corrupted records. I tried configuring job properties manually
>>> and using the SkipBadRecords class.
>>>
>>> Thanks,
>>> Justin
>>>
>>
>>
>>
>> --
>> Harsh J
>>
>

Mime
View raw message