lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Engels" <>
Subject RE: Lucene Optimized Query Broken?
Date Wed, 07 Jan 2004 17:12:49 GMT
I agree, sort of. I would think that even Lucene standard usage would
benefit greatly from this, especially for terms that occur very frequently,
think of the search +epson +printer - why search the entire printer index in
this case, especially if the printer index can be randomly accessed - if it
is sorted by document number.

Now that I've dug deeper into the code, I think the Lucene structure is very
close to supporting this. Once the outer query is flattened, just sort the
sub boolean term queries by frequency, then when evalutating the hits, ust
pass the required hits as a set into TermDocs?

I do have an Lucene IndexReader & IndexWriter implementation that uses a
relational datastore, and it is extremely fast, in many ways much faster
than Lucene's file system based indexing, especially for indexes that change

This is the last holdup, on having a truely lightening fast search system in
the relational store.

It sounds like your proposed changes ill work. If you need any assistance in
debugging, etc. please let  me know.


-----Original Message-----
From: Doug Cutting []
Sent: Wednesday, January 07, 2004 10:56 AM
To: Lucene Developers List
Subject: Re: Lucene Optimized Query Broken?

Robert Engels wrote:
> I have a index with documents that have only 2 fields, the first (unique)
> 'very unique', in that most document have at least somewhat varying terms,
> the second is a boolean that contains only (boolean) 'true' or 'false'.
> index contains 100,000,000+ documents.
> If I perform the following search "+unique:somevalue +boolean:true',
> with search on the first term, returning very few documents, but then it
> will search the second term, returning possibly a million+ documents, then
> it will intersect the list, return 'hits' of only a few documents.

First, this is not the sort of query that Lucene is designed to
efficiently handle.  Rather, this is the sort of thing that a relational
database is desgined for.  Lucene is primarily designed to support text
searching, where field values are natural language text and query terms
are words describing a user's interest.  You can implement full text
search with a relational database, but it will be slow.  Similarly, you
can search tabular data with Lucene, but it may be slow.

That said, I'm currently working on an optimization that will make such
queries substantially faster in Lucene.  The heart of it is to add data
to the index so that TermDocs.skipTo() is much faster.  Then the search
algorithms are modified to call TermDocs.skipTo().  This should make
conjunctive queries (ANDs and phrases) significantly faster when one
term occurs much less frequently than others.  I hope to check this in
in the next week or so.


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message