hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gross, Danny" <Danny.Gr...@spansion.com>
Subject Teragen defaults to 2 maps; terasort defaults to 1 reducer
Date Mon, 29 Jun 2009 21:03:51 GMT
Hello all,


I'm trying to run the hadoop-1.19.1-examples.jar teragen and terasort
programs on a cluster.  I have two problems with these programs:


1.	The data is generated in a fashion to where it is not balanced
across my cluster.  This is because the data is generated with 2 maps.

	*	With the command "hadoop jar hadoop-0.19.1-examples.jar
teragen 1000000000 /terasort"  (or any size) per the example doc, I get
2 maps.  With replication set to 2, this tends to place data more
heavily on 2 of my nodes, and the cluster believes it is balanced.


2.	The terasort program runs out of disk space on the reduce
operation.  This is because the program runs with a single reduce task.

	*	When running "hadoop jar hadoop-0.19.1-examples.jar
terasort /terasort /out" per the example doc, I get the appropriate
number of maps, but one reduce.  I've scoured the web and the new Hadoop
book, and I'm just not able to change the number of reducers.  An
example attempt was with the command "hadoop jar
-Dmapred.reduce.tasks=16 hadoop-0.19.1-examples.jar terasort /terasort


Could anyone help shed some light on how to modify the execution of
these programs to more appropriately balance the data, and spread the
reduce load out across my cluster?  


Best regards,


Danny Gross


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message