Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 66871927A for ; Thu, 26 Apr 2012 07:50:08 +0000 (UTC) Received: (qmail 13202 invoked by uid 500); 26 Apr 2012 07:50:06 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 13155 invoked by uid 500); 26 Apr 2012 07:50:06 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 13147 invoked by uid 99); 26 Apr 2012 07:50:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Apr 2012 07:50:05 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of uwe@thetaphi.de designates 188.138.97.18 as permitted sender) Received: from [188.138.97.18] (HELO mail.sd-datasolutions.de) (188.138.97.18) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Apr 2012 07:49:58 +0000 Received: from VEGA (gate2.marum.de [134.102.237.2]) by mail.sd-datasolutions.de (Postfix) with ESMTPSA id 9183814AA00A for ; Thu, 26 Apr 2012 07:49:35 +0000 (UTC) From: "Uwe Schindler" To: References: In-Reply-To: Subject: RE: lucene algorithm ? Date: Thu, 26 Apr 2012 09:49:51 +0200 Message-ID: <02d901cd2381$2a393d00$7eabb700$@thetaphi.de> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQIiOV7Vg0RYmzqC5e2UmGxvJDObSZYCi41Q Content-Language: de X-Virus-Checked: Checked by ClamAV on apache.org Hi, > I read the paper by Doug "Space optimizations for total ranking", > > since it was written a long time ago, I wonder what algorithms lucene uses > (regarding postings list traversal and score calculation, ranking) The algorithms described in that paper are still in use to merge posting lists of several terms. The "total ranking" as described here is implemented in several key parts of Lucene: #1 Section 4.1 (Parallel Merge) is done in BooleanScorer2.java #2 Section 4.2 (Block Processing) is done in BooleanScorer.java #3 Maintaining the queue of Top-N-Results is implemented in TopScoreDocCollector.java, the above scorer implementations call for each hit found the "collect(docid, score) method of the collector. > particularly the total ranking algorithm described there needs to traverse down > the entire postings list for all the query terms, so in case of very common query > terms like "yellow dog", either of the 2 terms may have a very very long > postings list in case of web search, are they all really traversed in current > lucene/Solr ? or any heuristics to truncate the list are actually employed? There are possibilities to truncate those lists, but this is not implemented in Lucene core. The main problem with Lucene's segmented index structure is, that you cannot early exit, because the very last document in the posting list could be one with a very high score. Techniques like index pruning or index sorting can optimize all this, but need index preprocessing and loose updatability of the index while in use. The Top-N result collection is optimized in Lucene (item #3), to throw away all collected documents, where the score is never be able to get into the top-n priority queue (too small score, once the queue is full). But the score has to be calculated. > in the case of returning top-k results, I can understand that partitioning the > postings list into multiple machines, and then combining the top-k from each > would work, but if we are required to return "the 100th result page", i.e. results > ranked from 990--1000th, then each partition would still have to find out the > top 1000, so partitioning would not help much. Yes, and because of that almost no search engine support deep paging. Lucene uses lots of memory when trying to do this, although there are new collector implementations in Lucene that allow "optimized" deep paging (included into item #3), if the top-n of the 99th result page were already found. In that case, you can throw away all documents with a higher score in the collection phase than the last one of page 99 when collection page 100. > overall, is there any up-to-date detailed docs on the internal algorithms of > lucene? Reading code is best documentation. But Javadocs also help, we recently updated the documentation of Lucene's internals for the coming version 4: https://builds.apache.org/job/Lucene-trunk/javadoc/ > Thanks a lot > Yang Uwe --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org