hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Question about Skip Bad Records
Date Sat, 15 Jun 2013 14:51:17 GMT

Please see comments in https://issues.apache.org/jira/browse/MAPREDUCE-1932

On Sat, Jun 15, 2013 at 12:09 PM, 小强 <790772019@qq.com> wrote:
> Hi, I found the SkippingRecordReader is no longer supported in the new api
> and I am curious about the reason, can anyone tell me.
> Besides, when I look into the old api and try to figure out what skip mode
> was doing, I am a little confused about the logic there.
> In my comprehension, if java api is used we can always precisely locate
> which one is the bad record.
> If streaming is used, as long as user can handle the counter correctly (I
> mean accumulate the counter for each record in), we can also locate the
> exact bad record. (I wonder if I miss something here)
> But if user don't care about the counter it's always a disaster for the
> framework to locate bad records (even using binary search)
> To sum up:
> Ques 1:  why skip mode is removed in the new api
> Ques 2:  if user handle counter correctly in streaming, can we locate the
> exact bad record
> Ques 3:  when in skip mode, why not locate more bad records by restart the
> user logic instead of locate one bad record for each task attempt
> Thank you!
> Dasheng Jiang

Harsh J

View raw message