lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael D. Curtin" <>
Subject Re: Long Query Performance
Date Mon, 22 Jan 2007 13:53:13 GMT
Somnath Banerjee wrote:
> Thanks for the reply. Good guess I think.
> DB (Index) is basically a collection of encyclopedia documents. Queries are
> also a collection of documents but of various domains. My task is to find
> out for each "query document" top 100 matching encyclopedia contents.
> I tried by using only the title of (5-8 words) the query documents instead
> of full text of the document. But that is also taking 0.5-1 sec for each
> query. That's mean it will also take nearly 6 and half days to run
> 0.72Mqueries (and expectedly the precision will suffer).

Thank you, the problem is a little clearer now.  This is a big search 
problem, so your biggest obstacle is running it on a tiny computer (from 
what you've said, only a fraction of the *queries* fit in your RAM 
budget, not to mention the database).  I'm confident that spending a 
little $$ on your computing resources, particularly RAM, would be FAR, 
FAR more cost-effective than programming labor.

But, if you can't get a bigger computer, then you can't.  In that case, 
I'm not sure that Lucene is the best tool for this problem, at least not 
exclusively.  You might find that some preprocessing, of the 
encyclopedia and query documents, could be used to quickly find the 
highest-probability candidates, then use Lucene to score them.  I'm 
think of a merge-sort kind of algorithm between the query "documents" 
and the encyclopedia documents.  For each query, find the top several 
hundred candidate documents from the encyclopedia, perhaps by number of 
matching non-"stop words" (see below).  You could also get more 
sophisticated by looking at word frequencies from your Lucene index, but 
I doubt it would make a huge difference for queries of hundreds of words 
that are looking for hundreds of hits.

A couple more suggestions:

-   Search the archives for the topic "more documents like this" and 
variations on that theme.  Several people have used Lucene in this way, 
with varying degrees of success.

-  If you haven't already, ditch "stop words", like "a", "the", 
prepositions, etc.  Your index will be smaller and so will your queries, 
making each search faster.

Good luck!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message