Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=Message-ID:Received:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type;
  b=kwqYkt9US9NVismv7vsS3+Wbl/WVwVX/OAmSkD6nxcOghAA6lDxl4apw85y4Ug2MtERPQSSaHEDs6CQdROekhfbnlRG/ax0wqIFaKJmQ7Sk+H19MK6p0H+J5+y3cxfqlkugNjexTdn6y2emx1l6ccxC/MDxZ9DVFV9R9oU5FNn4=
  ;
Message-ID: <20060610134045.68375.qmail@web50310.mail.yahoo.com>
Date: Sat, 10 Jun 2006 06:40:45 -0700 (PDT)
From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
Reply-To: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
Subject: Re: Different scoring mechanism
To: java-user@lucene.apache.org
In-Reply-To: <Pine.LNX.4.58.0606091201080.21598@hal.rescomp.berkeley.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

Chris,

Somebody recently asked me about how Lucene processes queries.  Other than working on required clauses in a BooleanQuery first, and skipping if there are no matching Docs for them, there are no other query optimization strategies/tricks, are there?

Otis

----- Original Message ----
From: Chris Hostetter <hossman_lucene@fucit.org>
To: java-user@lucene.apache.org
Sent: Friday, June 9, 2006 3:08:35 PM
Subject: RE: Different scoring mechanism


: For example: a query containing two terms: "fast", "car", having
: document frequencies 300.000 and 20.000 in the index respectively. In a
: worst case scenario this would require 320.000 document scores to be
: calculated. I am not really sure how lucene optimizes its search, but I
: guess it does that by first processing the documents having the highest
: term frequencies (and thus highest combined score) with these query
: terms, and pruning the search if the n hits have been found and it's
: certain that no document can be found which will give a higher score.

Nope.  Lucene scores all "matching" documents in the index in increasing
order of docId -- it can optimize the process using "skipTo" in Scorers
when it knows that it's not possible for for a document to "match" the
overall query, so it "skips ahead" to the first doc that can match.

ie: if you have a boolean query like "+title:cat +title:dog body:snake" it
knows that unless something matches title:cat and title:dog then there is
not point in checking wether it matches body:snake -- let alone scoring
hte doc at all.  so BooleanScorer uses skipTo on the individual Scorers
for title:cat and title:dog to keep skipping ahead untill it finds a doc
matching both, then it checks if it matches body:snake, and if it does
*then* it scores things.

: If I would change the next function in my own scorer to process all
: document ids, I am afraid I will wreck Lucene's optimization method (as
: I am then not serving the documents in descending term frequency order).

it would certianly eliminate lucenes ability to skip ahead (allthough
not in the way you imagined) ... but based on the way you've described how
you want scoring to work, it has to score every doc no matter what --
you've said that even if it doesn't contain the term at all it may get a
score value which needs to be factored in to the overall score.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org