Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 42681 invoked from network); 30 Oct 2006 20:36:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Oct 2006 20:36:44 -0000 Received: (qmail 6515 invoked by uid 500); 30 Oct 2006 20:36:41 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 6431 invoked by uid 500); 30 Oct 2006 20:36:41 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 6368 invoked by uid 99); 30 Oct 2006 20:36:41 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Oct 2006 12:36:41 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Oct 2006 12:36:26 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id EB1387142EC for ; Mon, 30 Oct 2006 12:35:18 -0800 (PST) Message-ID: <4345396.1162240518960.JavaMail.root@brutus> Date: Mon, 30 Oct 2006 12:35:18 -0800 (PST) From: "Yonik Seeley (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-687) Performance improvement: Lazy skipping on proximity file In-Reply-To: <15359530.1161163054989.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ http://issues.apache.org/jira/browse/LUCENE-687?page=comments#action_12445707 ] Yonik Seeley commented on LUCENE-687: ------------------------------------- I did some quick synthetic tests of my own.... Java5 -server 100,000 documents, 25 different terms, prob of term in doc = 1/term_number, sloppy phrase on 2 random terms Improvement: -3.1% Change term probability to 1/(term_number**2) Improvement: 10.0% Test only terms 24,25 with term probability=1/term_number (so probabilities were 0.5 and 1.0) Imprivement: -8.4% I think your patch is still worth applying because the likely distribution of terms, but I just wish that the performance hit was smaller in the worst-case scenario. > Performance improvement: Lazy skipping on proximity file > -------------------------------------------------------- > > Key: LUCENE-687 > URL: http://issues.apache.org/jira/browse/LUCENE-687 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Priority: Minor > Attachments: lazy_prox_skipping.patch > > > Hello, > I'm proposing a patch here that changes org.apache.lucene.index.SegmentTermPositions to avoid unnecessary skips and reads on the proximity stream. Currently a call of next() or seek(), which causes a movement to a document in the freq file also moves the prox pointer to the posting list of that document. But this is only necessary if actual positions have to be retrieved for that particular document. > Consider for example a phrase query with two terms: the freq pointer for term 1 has to move to document x to answer the question if the term occurs in that document. But *only* if term 2 also matches document x, the positions have to be read to figure out if term 1 and term 2 appear next to each other in document x and thus satisfy the query. > A move to the posting list of a document can be quite expensive. It has to be skipped to the last skip point before that document and then the documents between the skip point and the desired document have to be scanned, which means that the VInts of all positions of those documents have to be read and decoded. > An improvement is to move the prox pointer lazily to a document only if nextPosition() is called. This will become even more important in the future when the size of the proximity file increases (e. g. by adding payloads to the posting lists). > My patch implements this lazy skipping. All unit tests pass. > I also attach a new unit test that works as follows: > Using a RamDirectory an index is created and test docs are added. Then the index is optimized to make sure it only has a single segment. This is important, because IndexReader.open() returns an instance of SegmentReader if there is only one segment in the index. The proxStream instance of SegmentReader is package protected, so it is possible to set proxStream to a different object. I am using a class called SeeksCountingStream that extends IndexInput in a way that it is able to count the number of invocations of seek(). > Then the testcase searches the index using a PhraseQuery "term1 term2". It is known how many documents match that query and the testcase can verify that seek() on the proxStream is not called more often than number of search hits. > Example: > Number of docs in the index: 500 > Number of docs that match the query "term1 term2": 5 > Invocations of seek on prox stream (old code): 29 > Invocations of seek on prox stream (patched version): 5 > - Michael -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org