Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 47422 invoked from network); 22 Apr 2010 16:46:12 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Apr 2010 16:46:12 -0000 Received: (qmail 1750 invoked by uid 500); 22 Apr 2010 16:46:11 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 1705 invoked by uid 500); 22 Apr 2010 16:46:11 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 1698 invoked by uid 500); 22 Apr 2010 16:46:11 -0000 Delivered-To: apmail-lucene-java-dev@lucene.apache.org Received: (qmail 1695 invoked by uid 99); 22 Apr 2010 16:46:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Apr 2010 16:46:11 +0000 X-ASF-Spam-Status: No, hits=-1331.9 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Apr 2010 16:46:10 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o3MGjnnf017456 for ; Thu, 22 Apr 2010 16:45:50 GMT Message-ID: <7413795.141561271954749610.JavaMail.jira@thor> Date: Thu, 22 Apr 2010 12:45:49 -0400 (EDT) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Created: (LUCENE-2410) Optimize PhraseQuery MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Optimize PhraseQuery -------------------- Key: LUCENE-2410 URL: https://issues.apache.org/jira/browse/LUCENE-2410 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Fix For: 3.1 Looking the scorers for PhraseQuery, I think there are some speedups we could do: * The AND part of the scorer (which advances to the next doc that has all the terms), in PhraseScorer.doNext, should do the same optimizing as BooleanQuery's ConjunctionScorer, ie sort terms from rarest to most frequent. I don't think it should use a linked list/firstToLast() that it does today. * We do way too much work now when .score() is not called, because we go and find all occurrences of the phrase in the doc, whereas we should stop only after finding the first and then go and count the rest if .score() is called. * For the exact case, I think we can use two int arrays to find the matches. The first array holds the count of how many times a term in the phrase "matched" a phrase starting at that position. When that count == the number of terms in the phrase, it's a match. The 2nd is a "gen" array (holds docID when that count was last touched), to avoid clearing. Ie when incrementing the count, if the docID != gen, we reset count to 0. I think this'd be faster than the PQ we now use. Downside of this is if you have immense docs (position gets very large) we'd need 2 immense arrays. It'd be great to do LUCENE-1252 along with this, ie factor PhraseScorer into two AND'd sub-scorers (LUCENE-1252 is open for this). The first one should be ConjunctionScorer, and the 2nd one checks the positions (ie, either the exact or sloppy scorers). This would mean if the PhraseQuery is AND'd w/ other clauses (or, a filter is applied) we would save CPU by not checking the positions for a doc unless all other AND'd clauses accepted the doc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org