Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 87112 invoked from network); 6 Jan 2004 21:04:19 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 6 Jan 2004 21:04:19 -0000 Received: (qmail 88301 invoked by uid 500); 6 Jan 2004 21:04:06 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 88286 invoked by uid 500); 6 Jan 2004 21:04:06 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 88270 invoked from network); 6 Jan 2004 21:04:05 -0000 Received: from unknown (HELO razorbill.mail.pas.earthlink.net) (207.217.121.248) by daedalus.apache.org with SMTP; 6 Jan 2004 21:04:05 -0000 Received: from user-1121kfd.dsl.mindspring.com ([66.32.209.237] helo=ENGELSSERVER) by razorbill.mail.pas.earthlink.net with asmtp (Exim 3.33 #1) id 1AdyN0-0003ux-00 for lucene-dev@jakarta.apache.org; Tue, 06 Jan 2004 13:04:10 -0800 Reply-To: From: "Robert Engels" To: Subject: Lucene Optimized Query Broken? Date: Tue, 6 Jan 2004 15:04:11 -0600 Message-ID: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_004F_01C3D466.573CF030" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.6604 (9.0.2911.0) Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 X-ELNK-Trace: 33cbdd8ed9881ca8776432462e451d7b2728ff8d3d716ca3b7b6f3b51dc1939b8dabc123ac564e59350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N ------=_NextPart_000_004F_01C3D466.573CF030 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit I have implemented a IndexReader that uses a relational datastore, and in performing the queries (and reviewing the Lucene code), I see the following behavior with Lucene. It does has no way of limiting further searches based 'hits' on the more unique terms. UNLESS I AM MISSING SOMETHING... Say for example: I have a index with documents that have only 2 fields, the first (unique) is 'very unique', in that most document have at least somewhat varying terms, the second is a boolean that contains only (boolean) 'true' or 'false'. The index contains 100,000,000+ documents. If I perform the following search "+unique:somevalue +boolean:true', lucene with search on the first term, returning very few documents, but then it will search the second term, returning possibly a million+ documents, then it will intersect the list, return 'hits' of only a few documents. Shouldn't Lucene look at the 'term frequency', build the query in order of 'uniqueness', and then have some method of restricting further 'term' searches to only certain sets of documents? The only 'IndexReader' interface based support is TermEnum and TermDocs, but neither of these can take a 'document id set restriction'. THE SAME PROBLEM OCCURS WITH ONLY A SINGLE TERM AS WELL. Using the same example as above, a search like "+unique:someuniquevalue +unique:someveryuniquevalue" will still cause Lucene to read all of the index information for 'someverynonuniqueterm', rather than restrict the search to only those documents returned for 'someuniquevalue'. All types of queries should be reordered to restrict further searches, based on the matches/non-matches in REQUIRED/PROHIBITED term clauses. This overhead may not be noticeable in the default file-system based index, but given enough documents it would be..., or when the index information is stored on a network (possibly remote) file system. This behavior has been observed with the 1.3 final code. Robert Engels ------=_NextPart_000_004F_01C3D466.573CF030--