lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Morus Walter <>
Subject Re: multiple collections indexing
Date Thu, 20 Mar 2003 10:17:40 GMT

thanks for all your answers, I think I collect some of the hints and ideas
rather than commenting all of them apart.

Doug Cutting writes:
> Morus Walter wrote:
> > Searches must be able on any combination of collections.
> > A typical search includes ~ 40 collections.
> > 
> > Now the question is, how to implement this in lucene best.
> > 
> > Currently I see basically three possibilities:
> > - create a data field containing the collection name for each document
> >   and extend the query by a or-combined list of queries on this name filed.
> Are lots of different combinations of collections used frequently? 
> Probably not.  If only a handful of different subsets of collections are 
> frequently searched, then QueryFilter could be very useful.
Well the data in question consists of german encyclopedias, dictionaries and
glossaries. They are provided for a B2C website ( as well
as for various B2B deals. So on the one hand there are specific combinations
of collections. On the other hand all users are free to choose a 
subcollection they want to search. But this feature is used by at most 10% 
of the queries.

> In this approach you construct a QueryFilter for each combination of 
> collections, passing it the collection name query.  Keep the query 
> filter around and re-use it whenever a query with that combination of 
> collections is made.  This is very fast.  It uses one bit per document 
> per filter.  So if you have a million documents and eight common 
> combinations of collections then this would use one megabyte.
> You could also keep a cache of QueryFilters in a LinkedHashMap (JDK 
> 1.4).  If the size of the cache exceeds a limit, throw away its eldest 
> entry by overriding the removeEldestEntry method.  That way, if any 
> combination of collections is possible, but only a few are probable, you 
> can just cache the common subsets as QueryFilters.  Probably we should 
> provide such a QueryFilterCache class with Lucene...
Ok. I'll have a look at this.
I guess one could combine this with selections based on collection ids.

Thanks to John for his notes on multiple collections and
number of open files.
I guess it wouldn't be that worse, since our data is updated seldom
(a few collections get updates once a week or once a month, most of them
are updated less than once a year). So doing a separate indexing and
optimization would be possible.
If I understand optimization and your calculation correctly, this means
that the f * log_f(N) factor goes away.

Thanks to Vladimir for his explanations on wildcard queries and
the clarification on the search term limitation.

Thanks to Ype for his comments on multiple collections.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message