Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 38348 invoked from network); 23 Jun 2009 20:53:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Jun 2009 20:53:30 -0000 Received: (qmail 30494 invoked by uid 500); 23 Jun 2009 20:53:39 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 30457 invoked by uid 500); 23 Jun 2009 20:53:39 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 30447 invoked by uid 99); 23 Jun 2009 20:53:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jun 2009 20:53:38 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of nigelspleen@gmail.com designates 209.85.218.227 as permitted sender) Received: from [209.85.218.227] (HELO mail-bw0-f227.google.com) (209.85.218.227) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jun 2009 20:53:29 +0000 Received: by bwz27 with SMTP id 27so372568bwz.5 for ; Tue, 23 Jun 2009 13:53:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=XUe5quM1A46Xk0KIz8zz+506fZg5NsPYOkuroCuK6AA=; b=UNheGN5yJV83jHILk67TsfDr2Ou/GpFUX1w78sf3PHlYEy6ETF0cNNSXzg8kCWhaW7 aWClDuWviJ1Bq/W/YEs0Mv1Y6DAT3Mq4jbPkCjyVR5KRDOuPPza9legp+PrBlk4hqr/d aQk+sz4LtAAvILJ3PjLqe1EQmyKNIP90j30Oc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=FTqcS/idVQm+B2zU3WU98YuJys4rBBUmLut0Cx2yOLwCJZGJR/sJNTZyn3mdkiwYSt nTOvDr02dtqBHZubIyd5Q6RiuCPctObmrgCDttF6wpNo1MyXEo4P/1WHx1o/8wsEHI0f rBfn1YcUW3FwhIl9OtlEQZ+TVAWwCf7MjGKRg= MIME-Version: 1.0 Received: by 10.204.51.210 with SMTP id e18mr458550bkg.38.1245790389195; Tue, 23 Jun 2009 13:53:09 -0700 (PDT) Date: Tue, 23 Jun 2009 16:53:09 -0400 Message-ID: <843920a30906231353n6021c4bcl4fa7d14512c5f1b2@mail.gmail.com> Subject: Analyzing performance and memory consumption for boolean queries From: Nigel To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001636c5a69c9cd630046d0a2d25 X-Virus-Checked: Checked by ClamAV on apache.org --001636c5a69c9cd630046d0a2d25 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Our query performance is surprisingly inconsistent, and I'm trying to figure out why. I've realized that I need to better understand what's going on internally in Lucene when we're searching. I'd be grateful for any answers (including pointers to existing docs, if any). Our situation is this: We have roughly 250 million docs spread across four indexes. Each doc has about a dozen fields, all stored and most indexed. (They're the usual document things like author, date, title, contents, etc.) Queries differ in complexity but always have at least a few terms in boolean combination, up to some larger queries with dozens or even hundreds of terms combined with ands, ors, nots, and parens. There's no sorting, even by relevance: we just want to know what matches. Query performance is often sub-second, but not infrequently it can take over 20 seconds (we the time-limited hit collector, so anything over 20 seconds is stopped). Obviously the more complex queries are slower on average, but a given query can sometimes be much slower or much faster. My assumption is that we're having memory problems or disk utilization problems or both. Our app has a 5gb JVM heap on an 8gb server with no other user processes running, so we shouldn't be paging and should have some room for Linux disk cache. The server is lightly loaded and concurrent queries are the exception rather than the norm. Two of the four indexes are updated a few times a day via rsync and subsequently closed and re-opened, but poor query performance doesn't seem to be correlated with these times. So, getting to some specific questions: 1) How is the inverted index for a given field structured in terms of what's in memory and what's on disk? Is it dynamic, based on available memory, or tuneable, or fixed? Is there a rule of thumb that could be used to estimate how much memory is required per indexed field, based on the number of terms and documents? Likewise, is there a rule of thumb to estimate how many disk accesses are required to retrieve the hits for that field? (I'm thinking, by perhaps false analogy, of how a database maintains a b-tree structure that may reside partially in RAM cache and partially in disk pages.) 2) When boolean queries are searched, is it as simple as iterating the hits for each ANDed or ORed term and applying the appropriate logical operators to the results? For example, is searching for "foo AND bar" pretty much the same resource-wise as doing two separate searches, and therefore should the query performance be a linear function of the number the number of search terms? Or is there some other caching and/or decision logic (perhaps kind of like a database's query optimizer) at work here that makes the I/O and RAM requirements more difficult to model from the query? (Remember that we're not doing any sorting.) I'm hoping that with some of this knowledge, I'll be able to better model the RAM and I/O usage of the indexes and queries, and thus eventually understand why things are slow or fast. Thanks, Chris --001636c5a69c9cd630046d0a2d25--