hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: why does my mapper class reads my input file twice?
Date Tue, 06 Mar 2012 07:06:48 GMT
Its your use of the mapred.input.dir property, which is a reserved
name in the framework (its what FileInputFormat uses).

You have a config you extract path from:
Path input = new Path(conf.get("mapred.input.dir"));

Then you do:
FileInputFormat.addInputPath(job, input);

Which internally, simply appends a path to a config prop called
"mapred.input.dir". Hence your job gets launched with two input files
(the very same) - one added by default Tool-provided configuration
(cause of your -Dmapred.input.dir) and the other added by you.

Fix the input path line to use a different config:
Path input = new Path(conf.get("input.path"));

And run job as:
hadoop jar dummy-0.1.jar dummy.MyJob -Dinput.path=data/dummy.txt
-Dmapred.output.dir=result

On Tue, Mar 6, 2012 at 9:03 AM, Jane Wayne <jane.wayne2978@gmail.com> wrote:
> i have code that reads in a text file. i notice that each line in the text
> file is somehow being read twice. why is this happening?
>
> my mapper class looks like the following:
>
> public class MyMapper extends Mapper<LongWritable, Text, LongWritable,
> Text> {
>
> private static final Log _log = LogFactory.getLog(MyMapper.class);
>  @Override
> public void map(LongWritable key, Text value, Context context) throws
> IOException, InterruptedException {
> String s = (new
> StringBuilder()).append(value.toString()).append("m").toString();
> context.write(key, new Text(s));
> _log.debug(key.toString() + " => " + s);
> }
> }
>
> my reducer class looks like the following:
>
> public class MyReducer extends Reducer<LongWritable, Text, LongWritable,
> Text> {
>
> private static final Log _log = LogFactory.getLog(MyReducer.class);
>  @Override
> public void reduce(LongWritable key, Iterable<Text> values, Context
> context) throws IOException, InterruptedException {
> for(Iterator<Text> it = values.iterator(); it.hasNext();) {
> Text txt = it.next();
> String s = (new
> StringBuilder()).append(txt.toString()).append("r").toString();
> context.write(key, new Text(s));
> _log.debug(key.toString() + " => " + s);
> }
> }
> }
>
> my job class looks like the following:
>
> public class MyJob extends Configured implements Tool {
>
> public static void main(String[] args) throws Exception {
> ToolRunner.run(new Configuration(), new MyJob(), args);
> }
>
> @Override
> public int run(String[] args) throws Exception {
> Configuration conf = getConf();
> Path input = new Path(conf.get("mapred.input.dir"));
>    Path output = new Path(conf.get("mapred.output.dir"));
>
>    Job job = new Job(conf, "dummy job");
>    job.setMapOutputKeyClass(LongWritable.class);
>    job.setMapOutputValueClass(Text.class);
>    job.setOutputKeyClass(LongWritable.class);
>    job.setOutputValueClass(Text.class);
>
>    job.setMapperClass(MyMapper.class);
>    job.setReducerClass(MyReducer.class);
>
>    FileInputFormat.addInputPath(job, input);
>    FileOutputFormat.setOutputPath(job, output);
>
>    job.setJarByClass(MyJob.class);
>
>    return job.waitForCompletion(true) ? 0 : 1;
> }
> }
>
> the text file that i am trying to read in looks like the following. as you
> can see, there are 9 lines.
>
> T, T
> T, T
> T, T
> F, F
> F, F
> F, F
> F, F
> T, F
> F, T
>
> the output file that i get after my Job runs looks like the following. as
> you can see, there are 18 lines. each key is emitted twice from the mapper
> to the reducer.
>
> 0   T, Tmr
> 0   T, Tmr
> 6   T, Tmr
> 6   T, Tmr
> 12  T, Tmr
> 12  T, Tmr
> 18  F, Fmr
> 18  F, Fmr
> 24  F, Fmr
> 24  F, Fmr
> 30  F, Fmr
> 30  F, Fmr
> 36  F, Fmr
> 36  F, Fmr
> 42  T, Fmr
> 42  T, Fmr
> 48  F, Tmr
> 48  F, Tmr
>
> the way i execute my Job is as follows (cygwin + hadoop 0.20.2).
>
> hadoop jar dummy-0.1.jar dummy.MyJob -Dmapred.input.dir=data/dummy.txt
> -Dmapred.output.dir=result
>
> originally, this happened when i read in a sequence file, but even for a
> text file, this problem is still happening. is it the way i have setup my
> Job?



-- 
Harsh J

Mime
View raw message