Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of uwe@thetaphi.de designates
 188.138.97.18 as permitted sender)
From: "Uwe Schindler" <uwe@thetaphi.de>
To: <java-user@lucene.apache.org>
References: 
 <CAAnh3_8p1hvLionR+-PAkWa0KavggFgGK=DREJZwwDnNZC7U-g@mail.gmail.com>
In-Reply-To: 
 <CAAnh3_8p1hvLionR+-PAkWa0KavggFgGK=DREJZwwDnNZC7U-g@mail.gmail.com>
Subject: RE: lucene algorithm ?
Date: Thu, 26 Apr 2012 09:49:51 +0200
Message-ID: <02d901cd2381$2a393d00$7eabb700$@thetaphi.de>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Thread-Index: AQIiOV7Vg0RYmzqC5e2UmGxvJDObSZYCi41Q
Content-Language: de

Hi,
 
> I read the paper by Doug "Space optimizations for total ranking",
> 
> since it was written a long time ago, I wonder what algorithms lucene uses
> (regarding postings list traversal and score calculation, ranking)

The algorithms described in that paper are still in use to merge posting
lists of several terms. The "total ranking" as described here is implemented
in several key parts of Lucene:

#1 Section 4.1 (Parallel Merge) is done in BooleanScorer2.java
#2 Section 4.2 (Block Processing) is done in BooleanScorer.java
#3 Maintaining the queue of Top-N-Results is implemented in
TopScoreDocCollector.java, the above scorer implementations call for each
hit found the "collect(docid, score) method of the collector.

> particularly the total ranking algorithm described there needs to traverse
down
> the entire postings list for all the query terms, so in case of very
common query
> terms like "yellow dog", either of the 2 terms may have a very very long
> postings list in case of web search, are they all really traversed in
current
> lucene/Solr ? or  any heuristics to truncate the list are actually
employed?

There are possibilities to truncate those lists, but this is not implemented
in Lucene core. The main problem with Lucene's segmented index structure is,
that you cannot early exit, because the very last document in the posting
list could be one with a very high score. Techniques like index pruning or
index sorting can optimize all this, but need index preprocessing and loose
updatability of the index while in use.

The Top-N result collection is optimized in Lucene (item #3), to throw away
all collected documents, where the score is never be able to get into the
top-n priority queue (too small score, once the queue is full). But the
score has to be calculated.

> in the case of returning top-k results, I can understand that partitioning
the
> postings list into multiple machines, and then combining the  top-k from
each
> would work, but if we are required to return "the 100th result page", i.e.
results
> ranked from 990--1000th, then each partition would still have to find out
the
> top 1000, so partitioning would not help much.

Yes, and because of that almost no search engine support deep paging. Lucene
uses lots of memory when trying to do this, although there are new collector
implementations in Lucene that allow "optimized" deep paging (included into
item #3), if the top-n of the 99th result page were already found. In that
case, you can throw away all documents with a higher score in the collection
phase than the last one of page 99 when collection page 100.

> overall, is there any up-to-date detailed docs on the internal algorithms
of
> lucene?

Reading code is best documentation. But Javadocs also help, we recently
updated the documentation of Lucene's internals for the coming version 4:
https://builds.apache.org/job/Lucene-trunk/javadoc/

> Thanks a lot
> Yang

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org