hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Saile <da...@uni-koblenz.de>
Subject TeraSort bug?
Date Sun, 27 Feb 2011 22:27:16 GMT
Hi,

I have a problem concerning the TeraSort benchmark.
I am running the version that ships with hadoop-0.21.0 and if I use it as described (i.e.
TeraGen -TeraSort - TeraValidate), everything works fine.

However, for some tests I need to run, I added a simple job between TeraGen and TeraSort that
does nothing but copy the input. I included its code below. 

If I run this Copy-job after TeraGen, TeraSort will partition the input in a way, that most
tuples will go to the last reducer. 
For example if I run TeraSort with 500MB input, and 20 Reducers I get the following distribution:
-Reducers 0-18 process ~10.000 tuples each
-Reducer 19 processes ~5.000.000 tuples 

Can anyone reproduce this behavior? I would really appreciated any help!

David


public class Copy extends Configured implements Tool {

    public int run(String[] args) throws IOException, InterruptedException, ClassNotFoundException
{
  	Job job = Job.getInstance(new Cluster(getConf()), getConf());
    
  	Path inputDirOld = new Path(args[0]);
	TeraInputFormat.addInputPath(job, inputDirOld);
    	job.setInputFormatClass(TeraInputFormat.class);
    
    	job.setJobName("Copy");
    	job.setJarByClass(Void.class);
    	job.setMapOutputKeyClass(Text.class);
    	job.setMapOutputValueClass(Text.class);
    	
    	FileOutputFormat.setOutputPath(job, new Path(args[1]));
    	job.setOutputFormatClass(TeraOutputFormat.class);
    	job.setOutputKeyClass(Text.class);
    	job.setOutputValueClass(Text.class);

    	return job.waitForCompletion(true) ? 0 : 1;
		
    }

     public static void main(String[] args) throws Exception {
    	int res = ToolRunner.run(new Configuration(), new Void(), args);
    	System.exit(res);
     }
}
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message