hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ning Li <ning.li...@gmail.com>
Subject Re: how to optimize mapreduce procedure??
Date Fri, 13 Mar 2009 18:59:28 GMT
I would agree with Enis. MapReduce is good for batch building large
indexes, but not for search which requires realtime response.


On Fri, Mar 13, 2009 at 10:58 AM, Enis Soztutar <enis.soz@gmail.com> wrote:
> ZhiHong Fu wrote:
>> Hello,
>>           I'm writing a program which will finish lucene searching in
>> about 12 index directorys, all of them are stored in HDFS. It is done
>> like this:
>> 1. We get about 12 index Directorys through lucene index
>> functionality, each of which about 100M size,
>> 2. We store these 12 index directorys on hadoop HDFS , and this hadoop
>> cluster is made up of one namenode and five datanodes,totally 6
>> computers.
>> 3. And then I will do lucene searching for these 12 index directorys,
>> The mapreduce methods are as follows:
>>    Map Procedure: 12 index directory will be splitted into
>> numOfMapTasks,for example, if numOfMapTasks=3, then each map we will
>> get 4 indexDirs and store them in an Intermediate Result.
>>    Combine Procedure: for a intermediate Result locally, we will do
>> really lucene search in its containing index directory. and then store
>> these hit result in the intermediate Result.
>>    Reduce Procedure: Reduce the Intermediate Results' hit result. and
>> get the search Result.
>> But when I implement like this, I have a performance problem, I set
>> numOfMapTasks and numOfReduceTasks to any value,such as
>> numOfMapTasks=12,numOfReduceTasks=5, But a simple search method will
>> spend about 28 seconds, and Obviously It is unacceptable.
>> So I'm confused whether I did wrong map-reduce procedure or set wrong
>> num of map or reduce tasks. and generally where the overhead of
>> mapreduce proceduce will take place. Any suggestion will be
>> appreciated.
>> Thanks.
> Keeping the indexes at HDFS is not the best choice. Moreover mapreduce does
> not fit into the problem of distributed search over several nodes. The
> overhead of
> staring a new job for every search is not acceptable.
> You can use nutch distributed search or katta(not sure about the name)
> for this.

View raw message