Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of
 christoph.boosz@googlemail.com designates 209.85.219.219 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=googlemail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=WaWt4bJNrobQ9XpeEcSoD8R5ehkO9cvrM2NmdH8ZDDlwywQw/arMMqy2lyzrSg7yV1
         bm7kbwW8Qt/OlOT+/EF0lSq7yzvaB2dE1DX9+jSGuzyc9WBlcduEaUaDCHSEEEQ7XnUA
         Ph+k0LWogkWuIQi8AhQjkYPfI03AA0meDpK5Q=
MIME-Version: 1.0
In-Reply-To: <200910122254.37869.paul.elschot@xs4all.nl>
References: <8669da1e0910120553t28efc634t9f808c3675021e53@mail.gmail.com>
	 <4b124c310910121102k5fe69191r5a3ac0c5c4838bbd@mail.gmail.com>
	 <8669da1e0910121230s1a0770dfwdafae04006ef637d@mail.gmail.com>
	 <200910122254.37869.paul.elschot@xs4all.nl>
Date: Mon, 12 Oct 2009 23:29:07 +0200
Message-ID: <8669da1e0910121429o1e50c63u17bee88d730052ff@mail.gmail.com>
Subject: Re: faceted search performance
From: Christoph Boosz <christoph.boosz@googlemail.com>
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=0016e6d7eb46a749220475c39ea3

--0016e6d7eb46a749220475c39ea3
Content-Type: text/plain; charset=ISO-8859-1

Hi Paul,

Thanks for your suggestion. I will test it within the next few days.
However, due to memory limitations, it will only work if the number of hits
is small enough, am I right?

Chris

2009/10/12 Paul Elschot <paul.elschot@xs4all.nl>

> Chris,
>
> You could also store term vectors for all docs at indexing
> time, and add the termvectors for the matching docs into a
> (large) map of terms in RAM.
>
> Regards,
> Paul Elschot
>
>
> On Monday 12 October 2009 21:30:48 Christoph Boosz wrote:
> > Hi Jake,
> >
> > Thanks for your helpful explanation.
> > In fact, my initial solution was to traverse each document in the result
> > once and count the contained terms. As you mentioned, this process took a
> > lot of memory.
> > Trying to confine the memory usage with the facet approach, I was
> surprised
> > by the decline in performance.
> > Now I know it's nothing abnormal, at least.
> >
> > Chris
> >
> >
> > 2009/10/12 Jake Mannix <jake.mannix@gmail.com>
> >
> > > Hey Chris,
> > >
> > > On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz <
> > > christoph.boosz@googlemail.com> wrote:
> > >
> > > > Thanks for your reply.
> > > > Yes, it's likely that many terms occur in few documents.
> > > >
> > > > If I understand you right, I should do the following:
> > > > -Write a HitCollector that simply increments a counter
> > > > -Get the filter for the user query once: new CachingWrapperFilter(new
> > > > QueryWrapperFilter(userQuery));
> > > > -Create a TermQuery for each term
> > > > -Perform the search and read the counter of the HitCollector
> > > >
> > > > I did that, but it didn't get faster. Any ideas why?
> > > >
> > >
> > > This killer is the "TermQuery for each term" part - this is huge. You
> need
> > > to invert this process,
> > > and use your query as is, but while walking in the HitCollector, on
> each
> > > doc
> > > which matches
> > > your query, increment counters for each of the terms in that document
> > > (which
> > > means you need
> > > an in-memory forward lookup for your documents, like a multivalued
> > > FieldCache - and if you've
> > > got roughly the same number of terms as documents, this cache is likely
> to
> > > be as large as
> > > your entire index - a pretty hefty RAM cost).
> > >
> > > But a good thing to keep in mind is that doing this kind of faceting
> > > (massively multivalued
> > > on a huge term-set) requires a lot of computation, even if you have all
> the
> > > proper structures
> > > living in memory:
> > >
> > > For each document you look at (which matches your query), you need to
> look
> > > at all
> > > of the terms in that document, and increment a counter for that term.
>  So
> > > however much
> > > time it would normally take for you to do the driving query, it can
> take as
> > > much as that
> > > multiplied by the average number of terms in a document in your index.
>  If
> > > your documents
> > > are big, this could be a pretty huge latency penalty.
> > >
> > >  -jake
> > >
> >
>
>

--0016e6d7eb46a749220475c39ea3--