lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: performance on filtering against thousands of different publications
Date Tue, 14 Aug 2007 09:04:23 GMT
>>Do u mean it will count the number of documents for each publication source ?

Lucene does that for all terms. The Luke plugin simply offers a visualisation of the variance
in term frequencies for a field. It looks something like this: http://www.ucl.ac.uk/~ucbplrd/zipf.png


>>each set can be quite large (hundreds to thousands of publications

Ah. Filtering based on thousands of terms is likely to slow things down, even if they are
unpopular terms. I'd assumed the performance problem was due to a small number of popular
terms.

Some options:
1) Try minimise leaping around the disk - maybe sorting your selected terms will help. Look
at methods in TermEnum and TermDocs which you can use to build your own bitset from your (sorted)
list of terms.
2) Can you add higher-level terms to your index? Are the publications sources grouped into
stable "sets"? If so, why not index the content with a "publicationSet" field too and use
that in filters instead of querying using the members of the set (individual publication sources).

3) The <CachedFilter> tag in contrib's XMLQueryParser already implements an LRU policy
for caching nested filters or queries.

Cheers
Mark
----- Original Message ----
From: Cedric Ho <cedric.ho@gmail.com>
To: java-user@lucene.apache.org
Sent: Tuesday, 14 August, 2007 3:39:10 AM
Subject: Re: performance on filtering against thousands of different publications

On 8/13/07, mark harwood <markharw00d@yahoo.co.uk> wrote:
> I would presume that (like a lot of things) there is power-law at play in the popularity
of publication sources (i.e. a small number of popular sources and a lot of unpopular ones).
> The "Zipf" plugin in Luke can be used to illustrate this distribution for the values
in your "publication source" field.

Do u mean it will count the number of documents for each publication source ?

>
> Given this disparity, it makes sense to only cache Filters for the most popular publication
sources. Reading a large list of doc ids (the TermDocs) for these popular terms takes a lot
of time so it makes sense to cache them whereas it clearly is not valuable to use exactly
the same amount of memory (i.e. a new Bitset(reader.maxDoc) ) to cache an unpopular term whose
TermDocs can be read from disk quickly.
> I would use BooleanFilter to combine the user's choices of publication source terms and
use CachingWrapperFilter around (popular) individual Term Filters added to the BooleanFilter
rather than using CachingWrapperFilter around the BooleanFilter as a whole. This is because
your are much more likely to get cache hits on the popular individual terms than on a user's
particular selection of publication sources and these cached items can be combined together
in the BooleanFilter super fast.

We are also thinking about similar methods. i.e. caching some common
filters. Let me give a little more detail here.

Our clients usually search with only the default publication set.
However the default set of publications vary a lot for different
clients and each set can be quite large (hundreds to thousands of
publications).

So we are thinking we may want to use a cache of TermsFilter, where
each TermsFilter filter for a set of publications and maybe use some
LRU policy to manage the cache of filters.

This may eventually work, be we are also looking for other better alternatives.

Thanks,
Cedric

>
> Hope this makes sense
> Mark
>
> ----- Original Message ----
> From: Cedric Ho <cedric.ho@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Monday, 13 August, 2007 5:17:52 AM
> Subject: performance on filtering against thousands of different publications
>
> Hi all,
>
> My problem is as follows:
>
> Our documents each comes from a different publication. And we
> currently have > 5000 different publication sources.
>
> Our clients can choose arbitrarily a subset of the publications while
> performing search. It is not  uncommon that a search will have to
> match hundreds or thousands of publications.
>
> I currently try to index the publication information as a field in
> each document. and use a TermsFilter when performing search. However
> the performance is less than satisfactory. Many simple searches takes
> more than 2-3 seconds. (our goal: < 0.5seconds).
>
> Using the CachingWrapperFilter is great for search speed. But I've
> done some calculation and figured that it is basically impossible to
> cache all combination of publications or even some common
> combinations.
>
>
> Is there any other more effective way to do the filtering?
>
> (I know that the slowness is not purely due to the publication filter,
> we also have some other things that will slow down the search. But
> this one definitely contributed quite a lot to the overall search
> time)
>
> Regards,
> Cedric
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
>
>       ___________________________________________________________
> Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
> now.
> http://uk.answers.yahoo.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message