hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Writing to Mapper Context from RecordReader
Date Sat, 09 Apr 2011 09:58:13 GMT
Hello Adi,

On Thu, Apr 7, 2011 at 8:12 PM, Adi <adi.pandit@gmail.com> wrote:
> using 0.21.0. I have implemented a custom InputFormat. The RecordReader
> extends org.apache.hadoop.mapreduce.RecordReader<KEYIN, VALUEIN>
> The sample I looked at threw an IOException when there was incompatible
> input line. But I am not sure who is supposed to catch and handle this
> exception. The task just failed when this exception was thrown.
> I changed the implementation to log an error instead of throwing an
> IOException but the best thing would be to write to the output via context
> and report this error.
> But the RecordReader does not have a handle to the Mapper context.
> Is there a way to get a handle to the current Mapper context and write a
> message via the Mapper context from the RecordReader?
> Any other suggestions on handling bad input data when implementing Custom
> InputFormat?

I'd say logging is better, unless you also want to preserve
information on the bad records.

Anyways, to solve this, you can open a DFS file stream and write your
bad records to it. Have a look in the FAQ at [1] - That should be
doable from the RecordReader layer also.

If you can push this functionality (validation) down into your mapper,
you can leverage the MultipleOutputs feature to do this easily too.

Finally, If you can use the old API, this is possible via the
framework itself by using the 'Skip Bad Records' feature [2].

[1] - http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F
[2] - http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Skipping+Bad+Records

Harsh J

View raw message