hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Luo <lgpub...@yahoo.com.cn>
Subject Re: optimization help needed
Date Wed, 17 Mar 2010 13:15:55 GMT
you can control the number of reducers by JobConf.setNumReduceTasks(n). The number of mappers
is defined by (file size) / (split size). By default the split size is 64MB. Since you dataset
is not very large, there should be no big difference if you change these. 

if you are only interested in the number of blocks per email address, you don't need to send
the "original xml" as the value in the intermediate result. This can reduce the amount of
data sent from mappers to reducers. Use combiner to pre-aggregate the data may also help.


----- 原始邮件 ----
发件人: Reik Schatz <reik.schatz@bwin.org>
收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
发送日期: 2010/3/17 (周三) 5:04:33 上午
主   题: optimization help needed

Preparing a Hadoop presentation here. For demonstration I start up a 5 machine m1.large cluster
in EC2 via cloudera scripts ($hadoop-ec2 launch-cluster my-hadoop-cluster 5). Then I sent
a 500 MB xml file over into HDFS. The Mapper will receive a XML block as the key, select a
email address from the xml and use this as the key for the reducer and the orginal xml as
the value. The Reducer just aggregates the number of XML blocks per email address.

Running this on the cluster takes about 2:30 min. The frameworks uses 8 Mappers (Spills) and
2 Reducers. About 600.000 xml elements are contained in the file. How can I speed up processing
time? One thing I can think of, is to have more than just 2 email addresses in the sample
document to be able to use more than 2 reducers in parallel. Why did the framework choose
to use 8 mappers and not more? Maybe my sample data is too small to benefit from parallel
Thanks in advance


View raw message