hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Soztutar <enis....@gmail.com>
Subject Re: how to optimize mapreduce procedure??
Date Fri, 13 Mar 2009 14:58:17 GMT
ZhiHong Fu wrote:
> Hello,
>            I'm writing a program which will finish lucene searching in
> about 12 index directorys, all of them are stored in HDFS. It is done
> like this:
> 1. We get about 12 index Directorys through lucene index
> functionality, each of which about 100M size,
> 2. We store these 12 index directorys on hadoop HDFS , and this hadoop
> cluster is made up of one namenode and five datanodes,totally 6
> computers.
> 3. And then I will do lucene searching for these 12 index directorys,
> The mapreduce methods are as follows:
>     Map Procedure: 12 index directory will be splitted into
> numOfMapTasks,for example, if numOfMapTasks=3, then each map we will
> get 4 indexDirs and store them in an Intermediate Result.
>     Combine Procedure: for a intermediate Result locally, we will do
> really lucene search in its containing index directory. and then store
> these hit result in the intermediate Result.
>     Reduce Procedure: Reduce the Intermediate Results' hit result. and
> get the search Result.
> But when I implement like this, I have a performance problem, I set
> numOfMapTasks and numOfReduceTasks to any value,such as
> numOfMapTasks=12,numOfReduceTasks=5, But a simple search method will
> spend about 28 seconds, and Obviously It is unacceptable.
> So I'm confused whether I did wrong map-reduce procedure or set wrong
> num of map or reduce tasks. and generally where the overhead of
> mapreduce proceduce will take place. Any suggestion will be
> appreciated.
> Thanks.
Keeping the indexes at HDFS is not the best choice. Moreover mapreduce does
not fit into the problem of distributed search over several nodes. The 
overhead of
staring a new job for every search is not acceptable.
You can use nutch distributed search or katta(not sure about the name)
for this.

View raw message