Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 69574 invoked from network); 19 Mar 2003 17:23:01 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 19 Mar 2003 17:23:01 -0000 Received: (qmail 15661 invoked by uid 97); 19 Mar 2003 17:24:46 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@nagoya.betaversion.org Received: (qmail 15654 invoked from network); 19 Mar 2003 17:24:46 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 19 Mar 2003 17:24:46 -0000 Received: (qmail 69252 invoked by uid 500); 19 Mar 2003 17:22:57 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 69223 invoked from network); 19 Mar 2003 17:22:57 -0000 Received: from mail2.hypermall.com (216.241.37.118) by daedalus.apache.org with SMTP; 19 Mar 2003 17:22:57 -0000 Received: from [216.241.38.72] (helo=www.doomdark.org) by mail2.hypermall.com with esmtp (Exim 3.36 #1) id 18vhHH-0003sl-00 for lucene-user@jakarta.apache.org; Wed, 19 Mar 2003 10:22:59 -0700 Content-Type: text/plain; charset="iso-8859-1" From: Tatu Saloranta Reply-To: tatu@hypermall.net Organization: Linux-users missalie To: "Lucene Users List" Subject: Re: multiple collections indexing Date: Wed, 19 Mar 2003 10:27:42 -0700 User-Agent: KMail/1.4.3 References: <15992.11757.351412.446255@morus.xipolis.net> In-Reply-To: <15992.11757.351412.446255@morus.xipolis.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-Id: <200303191027.42667.tatu@hypermall.net> X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N On Wednesday 19 March 2003 01:44, Morus Walter wrote: ... > Searches must be able on any combination of collections. > A typical search includes ~ 40 collections. > > Now the question is, how to implement this in lucene best. > > Currently I see basically three possibilities: > - create a data field containing the collection name for each document > and extend the query by a or-combined list of queries on this name filed. > - create an index per collection and use a MultiSearcher to search all > interesting indexes. > - (a third on I just discovered): create a data field containing a > marker for each collection > x100000000000000000... for the first collection > x010000000000000000... for the second > x001000000000000000... for the third > and so on. > The query might use a wildcard search on this field using x?0?00000... > specifying '?' for each collection that should be searched on, and '0' > for the others. > The marker would be very long though (the number of collections is > growing, so we have to keep space for new one also). This might still be a feasible thing to do, except if number of collections changes very frequently (as you need to reindex all docs, not just incremental). Another possibility would be to have a new kind of Query; one to use with numeric field values (probably would be easiest to use hex numbers). In a way it'd be a specialized/optimized version of WildcardQuery. For example, one could define required bit pattern after ORing field value with mask (in your case you'd use one bit per type, and require non-interesting type flags to be zeroes, knowing that then at least one other bit, matching interesting type, is one). Implementing this would be fairly easy; first find the range (like RangeQuery does), and iterate over all existing terms in that range, and for each match against bit pattern, and add term if it matches the pattern. Actual search would then search pretty much like prefix, wildcard or range query, as Terms at that point have been expanded and search part need not care how they were obtained. This would make representation more compact (4 bits in a char instead of one), potentially making index bit smaller (which usually also means faster). And of course if you really want to push the limit, you could use even more efficient encoding (although, assuming indexes use UTF-8, base64 might be almost as efficient as it gets, as ascii chars only take one byte whereas upper chars take anywhere from 2 to 7 [for unicode-3? 4 for UC2] bytes). Adding such a query would need to be done outside QueryParser (as length of bitfield field would be variable), but in your case that probably shouldn't be a problem? Anyway, just an idea I thought might be worth sharing, -+ Tatu +- --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org