lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Long Query Performance
Date Mon, 22 Jan 2007 18:24:29 GMT
Somnath,

In addition to everything said, you could try and use:
- BooleanQuery.setUseScorer14(true), which works for long OR like
  queries as yours,
- a search method that returns a TopDocs instead of a Hits, and with that
- a FieldCache to retrieve the (primary key) values of the documents.

Regards,
Paul Elschot

P.S. Instead of setUseScorer14(true) you might try this patch, which  should be just as quick:
 http://issues.apache.org/jira/browse/LUCENE-730 .


On Monday 22 January 2007 15:36, mark harwood wrote:
> "MoreLikeThis.java" is in the "contrib" section of SVN and this will help optimise your
queries to searching for only the most discriminating terms.
> On a large index, very common terms can really kill performance (reading lots of docIds
from disk) and the MoreLikeThis class will help to trim your query terms down to avoid such
words.
> 
> This will still take some time to run but at least the job sounds like one that can easily
be split to run in parallel on multiple machines (assuming you have some to hand!)
> 
> Cheers
> Mark
> 
> 
> ----- Original Message ----
> From: Somnath Banerjee <somnath.banerjee@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Monday, 22 January, 2007 1:28:20 PM
> Subject: Re: Long Query Performance
> 
> Thanks for the reply. Good guess I think.
> 
> DB (Index) is basically a collection of encyclopedia documents. Queries are
> also a collection of documents but of various domains. My task is to find
> out for each "query document" top 100 matching encyclopedia contents.
> 
> I tried by using only the title of (5-8 words) the query documents instead
> of full text of the document. But that is also taking 0.5-1 sec for each
> query. That's mean it will also take nearly 6 and half days to run
> 0.72Mqueries (and expectedly the precision will suffer).
> 
> Thanks,
> Somnath
> 
> On 1/22/07, Michael D. Curtin <mike@curtin.com> wrote:
> >
> > Somnath Banerjee wrote:
> >
> > >            I have created a 8GB index of almost 2 million documents. My
> > > requirement is to run nearly 0.72 million query on this index. Each
> > query
> > > consists of 200 - 400 words. I have created a Boolean Query by ORing
> > these
> > > words. But each query is taking nearly 5 - 10 seconds to execute ( 2.78
> > > GHz,
> > > 1.5 GB RAM). That's mean the entire batch of 0.72M query will take more
> > > than
> > > 70 days to execute. Is it expected or there is a way to improve the
> > > performance? From earlier posts I gathered that complex query is
> > > expected to
> > > take more time (this much???).
> >
> > A back of the envelope calculation:
> >      8GB / 2M docs = 4KB per doc, on avg
> >                    / 5 B per word, on avg = 800 words per doc, on avg
> >
> > So, each query is a quarter to half the size of the average document.  I
> > suspect that just about every query is hitting almost every document in
> > the db, i.e. the queries are not very selective at all.  That's going to
> > be slow, no two ways about it.
> >
> > Could you tell us a bit more about the db and what your application is
> > looking for in it, at a higher level of abstraction?
> >
> > --MDC
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> 
> 
> 
> 	
> 	
> 		
> ___________________________________________________________ 
> New Yahoo! Mail is the ultimate force in competitive emailing. Find out more at the Yahoo!
Mail Championships. Plus: play games and win prizes. 
> http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 

Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message