Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 4449 invoked from network); 12 Oct 2009 18:02:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Oct 2009 18:02:36 -0000 Received: (qmail 23535 invoked by uid 500); 12 Oct 2009 18:02:34 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 23444 invoked by uid 500); 12 Oct 2009 18:02:33 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 23434 invoked by uid 99); 12 Oct 2009 18:02:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Oct 2009 18:02:33 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jake.mannix@gmail.com designates 209.85.211.184 as permitted sender) Received: from [209.85.211.184] (HELO mail-yw0-f184.google.com) (209.85.211.184) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Oct 2009 18:02:24 +0000 Received: by ywh14 with SMTP id 14so643138ywh.20 for ; Mon, 12 Oct 2009 11:02:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=Lu1G3LmFsQxW0S8yhlwLSKds1v1qMZbzSKOd0whznLA=; b=tVYfc+SLeeXaVRaLiJMKT1BuqDO56DBFJQkAmQ3xCewhNur5kND30DGji+e6pnN/Fq +hDbleGMmlPj+SW8erppM/8rdJj+QdtWhugim1puUk+PbONML3zJyVLhvfM/IBW7syeM kzOKuMYQ7oCeZGtlXMjnQGMSb+eWbM4nmvUNs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=tz/8v89vanTdMNDCd73i1f03WU/0toKzLtCwoD24xwNgzaY0stx3I5B80UkUQnfLmO xs7neXjwhEIcx5H5rsp6e7EuQyLqpKBZlKq07kDjF40Ya4MgKd92mQqtNwSBbZfQ2vTj vpRbUIjgwxBk9t7a2RhQ/p2IcaZp5VzMc1M0A= MIME-Version: 1.0 Received: by 10.91.191.17 with SMTP id t17mr3719372agp.51.1255370522877; Mon, 12 Oct 2009 11:02:02 -0700 (PDT) In-Reply-To: <8669da1e0910121030w5ad239d9u52b447ff89b4f87@mail.gmail.com> References: <8669da1e0910120553t28efc634t9f808c3675021e53@mail.gmail.com> <200910121745.12793.paul.elschot@xs4all.nl> <8837fb770910120932r21555aa0t59e9b36db5663f70@mail.gmail.com> <8669da1e0910121030w5ad239d9u52b447ff89b4f87@mail.gmail.com> Date: Mon, 12 Oct 2009 11:02:02 -0700 Message-ID: <4b124c310910121102k5fe69191r5a3ac0c5c4838bbd@mail.gmail.com> Subject: Re: faceted search performance From: Jake Mannix To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e646182e13e1790475c0baa1 X-Virus-Checked: Checked by ClamAV on apache.org --0016e646182e13e1790475c0baa1 Content-Type: text/plain; charset=ISO-8859-1 Hey Chris, On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz < christoph.boosz@googlemail.com> wrote: > Thanks for your reply. > Yes, it's likely that many terms occur in few documents. > > If I understand you right, I should do the following: > -Write a HitCollector that simply increments a counter > -Get the filter for the user query once: new CachingWrapperFilter(new > QueryWrapperFilter(userQuery)); > -Create a TermQuery for each term > -Perform the search and read the counter of the HitCollector > > I did that, but it didn't get faster. Any ideas why? > This killer is the "TermQuery for each term" part - this is huge. You need to invert this process, and use your query as is, but while walking in the HitCollector, on each doc which matches your query, increment counters for each of the terms in that document (which means you need an in-memory forward lookup for your documents, like a multivalued FieldCache - and if you've got roughly the same number of terms as documents, this cache is likely to be as large as your entire index - a pretty hefty RAM cost). But a good thing to keep in mind is that doing this kind of faceting (massively multivalued on a huge term-set) requires a lot of computation, even if you have all the proper structures living in memory: For each document you look at (which matches your query), you need to look at all of the terms in that document, and increment a counter for that term. So however much time it would normally take for you to do the driving query, it can take as much as that multiplied by the average number of terms in a document in your index. If your documents are big, this could be a pretty huge latency penalty. -jake --0016e646182e13e1790475c0baa1--