lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@lucene.com>
Subject Re: Two possible solutions on Parallel Searching
Date Thu, 13 Nov 2003 17:16:38 GMT
First, note that the approaches you describe will only improve 
performance if you have multiple CPUs and/or multiple disks holding the 
indexes.

Second, MultiSearcher is currently implemented to search indexes 
serially, not each in a separate thread.  To implement multi-threaded 
searching one could subclass MultiSearcher with, e.g., ParallelSearcher, 
and override the search() methods with multi-threaded implemenations. 
This would be a great contribution if someone is interested!

The parallel approach I prefer is to maintain a set of indexes, each on 
a separate machine, then use something like a ParallelSearcher of 
RemoteSearchables to search them all.

Doug

Jie Yang wrote:
> I had a thought on my earlier post on "Poor
> Performance when searching for 500+ terms". 
> 
> The problem is on how to improve the performance when
> searching for 500+ OR search terms. i.e. enter a
> search string of :
> 
> W1 OR W2 OR W3 OR ...... OR w500.
> 
> I thought I could rewrite the MultiSearcher class so
> that it can initiate multiple parallel IndexSearchers
> to perform the search.
> 
> Solution 1 would be divide the query string of "500 OR
> conditions terms" into 25 "20 OR conditions terms",
> and then pass them to MultiSearcher, MultiSearcher
> then initiate 25 threads to search on a single index
> directory.
> 
> Solution 2 would be when building an index of 1
> million docs, instead of building one single index
> containing 1 million docs,
> build 10 index directory eaching containing 100K
> records. then I pass a single query string of "500 OR
> conditions terms" to
> MultiSearcher, MultiSearcher then initiate 10 threads
> to search for 10 different index directories. 
> 
> Has anyone tried something similar, which solution
> would be a better one. Also is using multiple threads
> on a single directory a good ideal? Are there any
> bottlenecks for threads acessing resources, or
> I better pass requests into different processes. 
> 
> Thanks a lot
> 
> 
> 
> 
>  --- Jie Yang <jyang_work@yahoo.co.uk> wrote: > Thanks
> Julian
> 
>>I am not using RAMDirectory due to the large size of
>>index file. the index generated on hard disc is
>>1.57G
>>for 1 million documents, each document has average
>>500
>>terms. I am using Field.UnStored(fieldName, terms),
>>so
>>i beliece I am not storing the documents, just the
>>index. (is that right?) is there anyway to reduce
>>the
>>index size created? also What is the maximum size of
>>data can be stored in RAMDirectory? I suppose I
>>could
>>get a 10G RAM solaris box, but would that be
>>advisable
>>say storing 2-3G of index data in memory? Also, what
>>is the performance boost factor when RAMDirectory
>>comparing to FSDirectory. Are we taling about > 100%
>>here?
>>
>>On your 2nd and 3rd suggestion, I probably run the
>>latest code that includes the fix by Dmitry
>>Serebrennikov, the build was checked out from CVS
>>yesterday. and I used a QueryParser similar to the
>>one
>>used in the demo code.
>>
>>Again, I still feel a bit curious and want to find
>>out
>>does lucene do (or in the future) pre-filter on "AND
>>join conditions". For example, A AND (B OR C OR D).
>>if
>>A finds 100 docs out of 1 million, can lucene
>>restrict
>>the searchs on B,C,D only within the 100 docs found?
>>
>>Thanks a lot.
>>
>>
>>
>> 
>>
>>
>>
>>
>>>Response to: Poor Performance when searching for
>>
>>500+
>>
>>>terms (Jie Yang) 
>>
>>>From: Julien Nioche <Julien.Nioche@lingway.com>
>>
>>Subject: Poor Performance when searching for 500+
>>terms
>>
>>>Date: Thu, 13 Nov 2003 12:45:50 +0100
>>>Content-Type: text/plain; charset="iso-8859-1"
>>>
>>>Hello,
>>>
>>>Since there are a lot of Term objects in your
>>
>>Query, 
>>
>>>your application must
>>>spend a lot of time collecting information about 
>>>those Terms.
>>>
>>>1/ Do you use RAMDirectory? Loading the whole 
>>>Directory into memory will
>>>increase speed - your index must not be too big
>>
>>though
>>
>>>2/ You are probably not using the QueryParser - so 
>>>when you are building the
>>>Query you could sort the Term objects inside a 
>>>BooleanQuery. Sorting the
>>>Terms will reduce jumps on disk. I have no
>>
>>benchmarks
>>
>>
>>>for this, but
>>>logically, it should have some positive effect when
>>
>>>using FSDirectory. Am I wrong?
>>
>>>3/ There was a patch submitted by Dmitry
>>
>>Serebrennikov
>>
>>(http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg02762.html)
>>
>>>which reduced garbage collecting by limiting the 
>>>creation of temporary Term objects. This patch has 
>>>not been included in Lucene code (a bug in it?).
>>>
>>>Hope it helps.
>>>
>>>Julien
>>
>>
>>
> ________________________________________________________________________
> 
>>Want to chat instantly with your online friends? 
>>Get the FREE Yahoo!
>>Messenger http://mail.messenger.yahoo.co.uk
>> 
> 
> 
> ________________________________________________________________________
> Want to chat instantly with your online friends?  Get the FREE Yahoo!
> Messenger http://mail.messenger.yahoo.co.uk
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message