hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhaskar Ghosh <bjgin...@yahoo.co.in>
Subject Re: How to read whole files and output processed texts to another file through MapReduce
Date Fri, 19 Nov 2010 20:22:21 GMT
Hi Harsh/All,

I am getting exactly same error as stated by Kunal Gupta 
<kun...@techlead-india.com> in here: 


Kunal if you are still there,  please help me. Had you got the issue solved 

Exception in thread "main" java.lang.RuntimeException: 
    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:923)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:820)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.NoSuchMethodException: 
    at java.lang.Class.getConstructor0(Class.java:2706)
    at java.lang.Class.getDeclaredConstructor(Class.java:1985)

I have three classes:
1) PreTrainingProcessorMR is my driving program, containing the mapper and 
reducer classes [see code below, how I am running the job inside the main method 
of this file]
2) WholeFileTextInputFormat is my cutom InputFormat [attached]
3) WholeFileLineRecordReader is my custom RecordReader [attached]

I am executing the map-reduce program like:

Job job = new Job(conf, "PreTrainingProcessorMR");
    WholeFileTextInputFormat.addInputPath(job, new Path(otherArgs[0]));
    //FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);

I am really disturbed at this. Any idea if I am missing something? Any help 
would be very useful.

Bhaskar Ghosh
Hyderabad, India


"Ignorance is Bliss... Knowledge never brings Peace!!!"

From: Harsh J <qwertymaniac@gmail.com>
To: mapreduce-user@hadoop.apache.org
Sent: Wed, 17 November, 2010 9:40:44 AM
Subject: Re: How to read whole files and output processed texts to another file 
through MapReduce


On Wed, Nov 17, 2010 at 7:52 PM, Bhaskar Ghosh <bjgindia@yahoo.co.in> wrote:
> ---I am reading files within a directory and also subdirectories.

Currently FileInputFormat lets you read files for MapReduce, but does
not recurse into directories. Although globs are accepted in Path
strings, for proper recursion you need to implement the logic inside
your custom extended FileInputFormat yourself.

> ---Processing one file at a time

Doable by turning off file-splitting, or by creating custom SequenceFiles/HARs.

> ---Writing all the processed output to a single output file. [One output
> file per folder]

Doable with single reducer, but why do you require a single file?

> I think I need to give one file to one Mapper at a time, when all the
> mappers combine, one single reducer should write to a single file. [as I
> think we cannot write parallely to a single output file]

There's a "getmerge" feature the Hadoop DFS utils provide to retrieve
a DFS directory of outputs as a single file. You should use that
feature instead of bottling your reduce phase with a single reducer
instance (unless its a requirement of some sort).

See: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html for
the exact command syntax.

> Please suggest me (or point me to resources) so that I can:
> a) My map function gets one file at a time (instead of one line at a time)

I suggest pre-creating a Hadoop SequenceFile for this purpose, with
the <Key, Value> being <Filename, Contents>. Another solution would be
to use HAR. See
http://www.cloudera.com/blog/2009/02/the-small-files-problem/ for some
further discussion on this.

> b) Should implementing a custom RecordReader and/or FileInputFormat allow me
> to read files in subdirectories as well (one file at a time) ?

FileInputFormat.isSplittable is a method that tells if the input files
must be split into chunks for processing or not, and
FileInputFormat.listStatus is a method that lists all files
(FileStatus objects) in a directory to compute Mapper splits for.

You should write a custom class extending and overriding these methods
to ask it not to split files (false) and recurse yourself as required
to provide a proper list of FileStatus objects back to the framework.

(In trunk code, the recursion support has been added to
FileInputFormat itself. See MAPREDUCE-1501 on Apache's JIRA for the
specifics and a patch.)

Harsh J

View raw message