Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 78540 invoked from network); 12 Oct 2009 21:29:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Oct 2009 21:29:35 -0000 Received: (qmail 14020 invoked by uid 500); 12 Oct 2009 21:29:33 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 13924 invoked by uid 500); 12 Oct 2009 21:29:32 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 13914 invoked by uid 99); 12 Oct 2009 21:29:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Oct 2009 21:29:32 +0000 X-ASF-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of christoph.boosz@googlemail.com designates 209.85.219.219 as permitted sender) Received: from [209.85.219.219] (HELO mail-ew0-f219.google.com) (209.85.219.219) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Oct 2009 21:29:29 +0000 Received: by ewy19 with SMTP id 19so9368203ewy.28 for ; Mon, 12 Oct 2009 14:29:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=5j2mYi/bTeNs64jnaiOCeVcIVfghvkERl6skWXRBtVE=; b=ek06CHHiaFKfOGeMErBbcu5hXVmemBxEVhq4F9WJf76HSzNWlt/U23RxxQn/TJYG3b I8THbcUj6eQIyh8cMrLuNcIR35jRQaAEqOfYe9YQj7MITNDX9yj+d9doj76f9fLJj9+B 15bO4zSID1CWAdDxp86crSUJsf7CSowMiwNLM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=WaWt4bJNrobQ9XpeEcSoD8R5ehkO9cvrM2NmdH8ZDDlwywQw/arMMqy2lyzrSg7yV1 bm7kbwW8Qt/OlOT+/EF0lSq7yzvaB2dE1DX9+jSGuzyc9WBlcduEaUaDCHSEEEQ7XnUA Ph+k0LWogkWuIQi8AhQjkYPfI03AA0meDpK5Q= MIME-Version: 1.0 Received: by 10.216.88.140 with SMTP id a12mr2074009wef.157.1255382947678; Mon, 12 Oct 2009 14:29:07 -0700 (PDT) In-Reply-To: <200910122254.37869.paul.elschot@xs4all.nl> References: <8669da1e0910120553t28efc634t9f808c3675021e53@mail.gmail.com> <4b124c310910121102k5fe69191r5a3ac0c5c4838bbd@mail.gmail.com> <8669da1e0910121230s1a0770dfwdafae04006ef637d@mail.gmail.com> <200910122254.37869.paul.elschot@xs4all.nl> Date: Mon, 12 Oct 2009 23:29:07 +0200 Message-ID: <8669da1e0910121429o1e50c63u17bee88d730052ff@mail.gmail.com> Subject: Re: faceted search performance From: Christoph Boosz To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e6d7eb46a749220475c39ea3 --0016e6d7eb46a749220475c39ea3 Content-Type: text/plain; charset=ISO-8859-1 Hi Paul, Thanks for your suggestion. I will test it within the next few days. However, due to memory limitations, it will only work if the number of hits is small enough, am I right? Chris 2009/10/12 Paul Elschot > Chris, > > You could also store term vectors for all docs at indexing > time, and add the termvectors for the matching docs into a > (large) map of terms in RAM. > > Regards, > Paul Elschot > > > On Monday 12 October 2009 21:30:48 Christoph Boosz wrote: > > Hi Jake, > > > > Thanks for your helpful explanation. > > In fact, my initial solution was to traverse each document in the result > > once and count the contained terms. As you mentioned, this process took a > > lot of memory. > > Trying to confine the memory usage with the facet approach, I was > surprised > > by the decline in performance. > > Now I know it's nothing abnormal, at least. > > > > Chris > > > > > > 2009/10/12 Jake Mannix > > > > > Hey Chris, > > > > > > On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz < > > > christoph.boosz@googlemail.com> wrote: > > > > > > > Thanks for your reply. > > > > Yes, it's likely that many terms occur in few documents. > > > > > > > > If I understand you right, I should do the following: > > > > -Write a HitCollector that simply increments a counter > > > > -Get the filter for the user query once: new CachingWrapperFilter(new > > > > QueryWrapperFilter(userQuery)); > > > > -Create a TermQuery for each term > > > > -Perform the search and read the counter of the HitCollector > > > > > > > > I did that, but it didn't get faster. Any ideas why? > > > > > > > > > > This killer is the "TermQuery for each term" part - this is huge. You > need > > > to invert this process, > > > and use your query as is, but while walking in the HitCollector, on > each > > > doc > > > which matches > > > your query, increment counters for each of the terms in that document > > > (which > > > means you need > > > an in-memory forward lookup for your documents, like a multivalued > > > FieldCache - and if you've > > > got roughly the same number of terms as documents, this cache is likely > to > > > be as large as > > > your entire index - a pretty hefty RAM cost). > > > > > > But a good thing to keep in mind is that doing this kind of faceting > > > (massively multivalued > > > on a huge term-set) requires a lot of computation, even if you have all > the > > > proper structures > > > living in memory: > > > > > > For each document you look at (which matches your query), you need to > look > > > at all > > > of the terms in that document, and increment a counter for that term. > So > > > however much > > > time it would normally take for you to do the driving query, it can > take as > > > much as that > > > multiplied by the average number of terms in a document in your index. > If > > > your documents > > > are big, this could be a pretty huge latency penalty. > > > > > > -jake > > > > > > > --0016e6d7eb46a749220475c39ea3--