lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: performance on filtering against thousands of different publications
Date Mon, 13 Aug 2007 13:17:49 GMT
Have you tried the very simple techinque if just making an OR clause
containing all the sources for a particular query and just letting
it run? I was surprised at the speed...

But before doing *any* of that, you need to find out, and tell us, what
exactly is taking the time. Are you opening a new IndexReader for
each query? Are you iterating through a Hits object that has more than
100 (maybe it's 200 now) entries? Are you loading each document that
satisfies the query? Etc. Etc.

Put some simple timers in your code and measure exactly what's taking the
time before tuning your code. Time the call to search. Time the call for
parsing. Time the assembly of the responses, in, say, blocks of 100.

You simply cannot improve your code without knowing, through
measurement, what is taking the time. Virtually every time I've tried to
improve speed without measuring first, I've been wrong <G>..

BTW, have you looked over the suggestions here?

http://wiki.apache.org/lucene-java/ImproveSearchingSpeed


Best
Erick

On 8/13/07, Cedric Ho <cedric.ho@gmail.com> wrote:
>
> Hi all,
>
> My problem is as follows:
>
> Our documents each comes from a different publication. And we
> currently have > 5000 different publication sources.
>
> Our clients can choose arbitrarily a subset of the publications while
> performing search. It is not  uncommon that a search will have to
> match hundreds or thousands of publications.
>
> I currently try to index the publication information as a field in
> each document. and use a TermsFilter when performing search. However
> the performance is less than satisfactory. Many simple searches takes
> more than 2-3 seconds. (our goal: < 0.5seconds).
>
> Using the CachingWrapperFilter is great for search speed. But I've
> done some calculation and figured that it is basically impossible to
> cache all combination of publications or even some common
> combinations.
>
>
> Is there any other more effective way to do the filtering?
>
> (I know that the slowness is not purely due to the publication filter,
> we also have some other things that will slow down the search. But
> this one definitely contributed quite a lot to the overall search
> time)
>
> Regards,
> Cedric
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message