hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reik Schatz <reik.sch...@bwin.org>
Subject optimization help needed
Date Wed, 17 Mar 2010 09:04:33 GMT
Preparing a Hadoop presentation here. For demonstration I start up a 5 
machine m1.large cluster in EC2 via cloudera scripts ($hadoop-ec2 
launch-cluster my-hadoop-cluster 5). Then I sent a 500 MB xml file over 
into HDFS. The Mapper will receive a XML block as the key, select a 
email address from the xml and use this as the key for the reducer and 
the orginal xml as the value. The Reducer just aggregates the number of 
XML blocks per email address.

Running this on the cluster takes about 2:30 min. The frameworks uses 8 
Mappers (Spills) and 2 Reducers. About 600.000 xml elements are 
contained in the file. How can I speed up processing time? One thing I 
can think of, is to have more than just 2 email addresses in the sample 
document to be able to use more than 2 reducers in parallel. Why did the 
framework choose to use 8 mappers and not more? Maybe my sample data is 
too small to benefit from parallel processing. 

Thanks in advance

View raw message