Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Content-Type: text/plain;
  charset="iso-8859-1"
From: Tatu Saloranta <tatu@hypermall.net>
Reply-To: tatu@hypermall.net
Organization: Linux-users missalie
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Subject: Re: multiple collections indexing
Date: Wed, 19 Mar 2003 10:27:42 -0700
User-Agent: KMail/1.4.3
References: <15992.11757.351412.446255@morus.xipolis.net>
In-Reply-To: <15992.11757.351412.446255@morus.xipolis.net>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Message-Id: <200303191027.42667.tatu@hypermall.net>

On Wednesday 19 March 2003 01:44, Morus Walter wrote:
...
> Searches must be able on any combination of collections.
> A typical search includes ~ 40 collections.
>
> Now the question is, how to implement this in lucene best.
>
> Currently I see basically three possibilities:
> - create a data field containing the collection name for each document
>   and extend the query by a or-combined list of queries on this name filed.
> - create an index per collection and use a MultiSearcher to search all
>   interesting indexes.
> - (a third on I just discovered): create a data field containing a
>   marker for each collection
>   x100000000000000000... for the first collection
>   x010000000000000000... for the second
>   x001000000000000000... for the third
>   and so on.
>   The query might use a wildcard search on this field using x?0?00000...
>   specifying '?' for each collection that should be searched on, and '0'
>   for the others.
>   The marker would be very long though (the number of collections is
>   growing, so we have to keep space for new one also).

This might still be a feasible thing to do, except if number of collections 
changes very frequently (as you need to reindex all docs, not just 
incremental).

Another possibility would be to have a new kind of Query; one to use with 
numeric field values (probably would be easiest to use hex numbers). In a way 
it'd be a specialized/optimized version of WildcardQuery.

For example, one could define required bit pattern after ORing field value 
with mask (in your case you'd use one bit per type, and require 
non-interesting type flags to be zeroes, knowing that then at least one other 
bit, matching interesting type, is one).
Implementing this would be fairly easy; first find the range (like RangeQuery 
does), and iterate over all existing terms in that range, and for each match 
against bit pattern, and add term if it matches the pattern.

Actual search would then search pretty much like prefix, wildcard or range 
query, as Terms at that point have been expanded and search part need not 
care how they were obtained.

This would make representation more compact (4 bits in a char instead of one), 
potentially making index bit smaller (which usually also means faster). And 
of course if you really want to push the limit, you could use even more 
efficient encoding (although, assuming indexes use UTF-8, base64 might be 
almost as efficient as it gets, as ascii chars only take one byte whereas 
upper chars take anywhere from 2 to 7 [for unicode-3? 4 for UC2] bytes).

Adding such a query would need to be done outside QueryParser (as length of 
bitfield field would be variable), but in your case that probably shouldn't 
be a problem?

Anyway, just an idea I thought might be worth sharing,

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org