hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <guillaume.vil...@orange-ftgroup.com>
Subject RE: How to write a custom input format and record reader to read multiple lines of text from files
Date Tue, 01 Dec 2009 09:21:38 GMT
I've developed a version of a MultipleLineTextInputFormat for hadoop 0.19. I think it is not
perfect but it works for my needs.
I've attached the code, feel free to improve or use it. Do not hesitate to contact me if you
improve the code.

-----Message d'origine-----
De : Kunal Gupta [mailto:kunal@techlead-india.com] 
Envoyé : mardi 1 décembre 2009 09:50
À : mapreduce-user@hadoop.apache.org
Objet : Re: How to write a custom input format and record reader to read multiple lines of
text from files

NLineInputFormat will help in splitting N lines of text for each Mapper,
but it will still pass single line of text to each call to the Map

I want N lines of text to be passed as 'value' to the Map function.

By extending FileInputFormat and RecordReader classes i am concatinating
N lines of text and setting that as the 'value'.

But this program is not running. Probably some initialization error.

I am intimating the framework to use my extended classes as InputFormat:


On Tue, 2009-12-01 at 13:53 +0530, Amogh Vasekar wrote:
> Hi,
> The NLineInputFormat (o.a.h.mapreduce.lib.input) achieves more or less
> the same, and should help you guide writing custom input format :)
> Amogh
> On 12/1/09 11:47 AM, "Kunal Gupta" <kunal@techlead-india.com> wrote:
>         Can someone explain how to override the "FileInputFormat" and
>         "RecordReader" in order to be able to read multiple lines of
>         text from
>         input files in a single map task?
>         Here the key will be the offset of the first line of text and
>         value will
>         be the N lines of text.
>         I have overridden the class FileInputFormat:
>         public class MultiLineFileInputFormat
>                 extends FileInputFormat<LongWritable, Text>{
>         ...
>         }
>         and implemented the abstract method:
>         public RecordReader createRecordReader(InputSplit split,
>                         TaskAttemptContext context)
>                  throws IOException, InterruptedException {...}
>         I have also overridden the recordreader class:
>         public class MultiLineFileRecordReader extends
>         RecordReader<LongWritable, Text>
>         {...}
>         and in the job configuration, specified this new InputFormat
>         class:
>         job.setInputFormatClass(MultiLineFileInputFormat.class);
>         --------------------------------------------------------------------------
>         When I  run this new map/reduce program, i get the following
>         java error:
>         --------------------------------------------------------------------------
>         Exception in thread "main" java.lang.RuntimeException:
>         java.lang.NoSuchMethodException: CustomRecordReader
>         $MultiLineFileInputFormat.<init>()
>                 at
>         org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
>                 at
>         org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
>                 at
>         org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
>                 at
>         org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>                 at
>         org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>                 at
>         CustomRecordReader.main(CustomRecordReader.java:257)
>         Caused by: java.lang.NoSuchMethodException: CustomRecordReader
>         $MultiLineFileInputFormat.<init>()
>                 at java.lang.Class.getConstructor0(Class.java:2706)
>                 at
>         java.lang.Class.getDeclaredConstructor(Class.java:1985)
>                 at
>         org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
>                 ... 5 more

This message and any attachments (the "message") are confidential and intended solely for
the addressees. 
Any unauthorised use or dissemination is prohibited.
Messages are susceptible to alteration. 
France Telecom Group shall not be liable for the message if altered, changed or falsified.
If you are not the intended addressee of this message, please cancel it immediately and inform
the sender.

View raw message