Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 639B19C9B for ; Thu, 27 Oct 2011 19:32:54 +0000 (UTC) Received: (qmail 17542 invoked by uid 500); 27 Oct 2011 19:32:52 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 17503 invoked by uid 500); 27 Oct 2011 19:32:52 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 17495 invoked by uid 99); 27 Oct 2011 19:32:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Oct 2011 19:32:52 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of felipehummel@gmail.com designates 209.85.220.176 as permitted sender) Received: from [209.85.220.176] (HELO mail-vx0-f176.google.com) (209.85.220.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Oct 2011 19:32:45 +0000 Received: by vcdn13 with SMTP id n13so4280934vcd.35 for ; Thu, 27 Oct 2011 12:32:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=zeCABpQKZ2qc3OZiObOTMBH8AhMN5e+KCc0+NgpXdDs=; b=MxCRg/Ol+8+F0TmRF+3U+moDub26TIG3S2Taehk2fUXtv2Z/0kU/wiw9G/lqCGFKR2 jMdHQGDB3pFD25KrsH4sRm8LhevhrZWOIwJdIbWAic7aTGa242zJw+rUIeYeKGVjXMXW KLw+NO9DimiCMD3UKZa+RJ2RgajBrXAtvnfUw= Received: by 10.220.3.5 with SMTP id 5mr1244651vcl.262.1319743944077; Thu, 27 Oct 2011 12:32:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.215.1 with HTTP; Thu, 27 Oct 2011 12:32:04 -0700 (PDT) In-Reply-To: <1319459363.39858.YahooMailNeo@web30501.mail.mud.yahoo.com> References: <1319375162.6743.YahooMailNeo@web30503.mail.mud.yahoo.com> <1319459363.39858.YahooMailNeo@web30501.mail.mud.yahoo.com> From: Felipe Hummel Date: Thu, 27 Oct 2011 15:32:04 -0400 Message-ID: Subject: Re: performance question - number of documents To: java-user@lucene.apache.org, sol myr Content-Type: multipart/alternative; boundary=00032557641efb31f104b04cd470 X-Virus-Checked: Checked by ClamAV on apache.org --00032557641efb31f104b04cd470 Content-Type: text/plain; charset=UTF-8 Hi, there are two types of query processing in document retrieval: document-at-a-time and term-at-a-time. Lucene uses document-at-a-time processing. That means the posting lists (the list of documents a word appears in) is sorted by the document IDs. This type of processing is usually better for large datasets because of the better memory usage. It is also great for AND queries as it can skip documents (using skip-lists) during the processing. On a term-at-a-time processing, on the other hand, the documents are sorted by some "metric", usually Term-frequency. So what you said would apply. See the first page of this articlefor more information. Felipe Hummel On Mon, Oct 24, 2011 at 8:29 AM, sol myr wrote: > Hi, > > Thanks for this reply. > > Could I please just ask - doesn't Lucene keep the data sorted, at least > partially (heuristically)? > > E.g. if the reverse index says "the word DOE appears in documents #1, #7, > #5" . > Won't Lucene do some smart sorting on this list of documents? Maybe by > frequency, first listing documents that contain many appearances of DOE? > > I know ranking considers more subtle factors such as document length, "idf" > to prioritize rare words, etc. > But if there are 8 million documents with the word DOE, and I only asked > for the top 5, I might take a risk and limit the change to (say) 1000 > documents that contain most appearances of that word, and only between them > bother to calculate the exact ranking... > > That's not criticism, I'm no algorithms expert, I just raise the question > and try to learn... > Insights would be appreciated :) > Thanks again. > > > > > ----- Original Message ----- > From: Erick Erickson > To: java-user@lucene.apache.org; sol myr > Cc: > Sent: Sunday, October 23, 2011 7:18 PM > Subject: Re: performance question - number of documents > > "Why would it matter...top 5 matches" Because Lucene has to calculate > the score of all documents in order to insure that it returns those 5 > documents. > What if the very last document scored was the most relevant? > > Best > Erick > > On Sun, Oct 23, 2011 at 3:06 PM, sol myr wrote: > > Hi, > > > > We've noticed some Lucene performance phenomenon, and would appreciate an > explanation from anyone familiar with Lucene internals > > > > (I know Lucene as a user, but haven't looked under its hood). > > > > We have a Lucene index of about 30 million records. > > We ran 2 queries: "AND" and "OR" ("+john +doe" versus "john doe"). > > The AND query had much better performance (AND takes about 500 millis, > while OR takes about 2000 millis). > > > > We wondered whether this has anything to do with the number of potential > matches? > > Our AND has only about 5000 matches (5000 documents contain *both* "john" > and "doe"). > > Our OR has about 8 million matches (8 million documents contain *either* > "john" or "doe"). > > > > > > Does this explain the performance difference? > > But why would it matter, as long as we take only the top 5 matches ( > indexSearcher.search(query, 5))...? > > Is there any other explanation? > > > > Thanks :) > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --00032557641efb31f104b04cd470--