Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Reply-To: <rengels@ix.netcom.com>
From: "Robert Engels" <rengels@ix.netcom.com>
To: <lucene-dev@jakarta.apache.org>
Subject: Lucene Optimized Query Broken?
Date: Tue, 6 Jan 2004 15:04:11 -0600
Message-ID: <LMENLAOACIBLMOIILNNNKEIIDLAA.rengels@ix.netcom.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_000_004F_01C3D466.573CF030"
Importance: Normal

------=_NextPart_000_004F_01C3D466.573CF030
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

I have implemented a IndexReader that uses a relational datastore, and in
performing the queries (and reviewing the Lucene code), I see the following
behavior with Lucene.

It does has no way of limiting further searches based 'hits' on the more
unique terms. UNLESS I AM MISSING SOMETHING...

Say for example:

I have a index with documents that have only 2 fields, the first (unique) is
'very unique', in that most document have at least somewhat varying terms,
the second is a boolean that contains only (boolean) 'true' or 'false'. The
index contains 100,000,000+ documents.

If I perform the following search "+unique:somevalue +boolean:true', lucene
with search on the first term, returning very few documents, but then it
will search the second term, returning possibly a million+ documents, then
it will intersect the list, return 'hits' of only a few documents.

Shouldn't Lucene look at the 'term frequency', build the query in order of
'uniqueness', and then have some method of restricting further 'term'
searches to only certain sets of documents? The only 'IndexReader' interface
based support is TermEnum and TermDocs, but neither of these can take a
'document id set restriction'.

THE SAME PROBLEM OCCURS WITH ONLY A SINGLE TERM AS WELL. Using the same
example as above, a search like

"+unique:someuniquevalue +unique:someveryuniquevalue"

will still cause Lucene to read all of the index information for
'someverynonuniqueterm', rather than restrict the search to only those
documents returned for 'someuniquevalue'.

All types of queries should be reordered to restrict further searches, based
on the matches/non-matches in REQUIRED/PROHIBITED term clauses.

This overhead may not be noticeable in the default file-system based index,
but given enough documents it would be..., or when the index information is
stored on a network (possibly remote) file system.

This behavior has been observed with the 1.3 final code.

Robert Engels

------=_NextPart_000_004F_01C3D466.573CF030--