Return-Path: Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 53133 invoked from network); 11 Sep 2003 18:05:54 -0000 Received: from unknown (HELO mailshell.com) (209.157.66.249) by daedalus.apache.org with SMTP; 11 Sep 2003 18:05:54 -0000 Received: (qmail 18022 invoked from network); 11 Sep 2003 18:05:58 -0000 Received: from unknown (HELO lucene.com) (dcutting@grandcentral.com@12.210.200.74) by mail.mailshell.com with SMTP; 11 Sep 2003 18:05:58 -0000 Message-ID: <3F60B981.5090503@lucene.com> Date: Thu, 11 Sep 2003 11:05:53 -0700 From: Doug Cutting User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030701 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: Lucene features References: <3DC24860-E0DD-11D7-ADF3-000393A564E6@ehatchersolutions.com> <3F5B4D67.9090402@seznam.cz> In-Reply-To: <3F5B4D67.9090402@seznam.cz> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Leo Galambos wrote: > Example: I use this notation: inverted_list_term:{list of W values, "-" > denotes W=0, for 12 documents in a collection} > A:{23[16]------27} > B:{--[38]--------} > C:{18[2-]45239812} > If your first query is B, the subset of documents (denoted by brackets - > namely, the 3rd and 4th doc) is selected, and if your second query is "A > C", then you cannot use global IDFs, because in the subset, the IDF > factors are different. Globally, A is better distriminator, but in the > subset, C is better. This fact is then reflected by the hit list you > generate, and I guess, the quality will be also affected by this. > > The example shows, that you would rather export the subset to an > auxiliary index (RAMDirectory?) and then use this structure instead of > the original index. Obviously, it will solve the issue of speed you > mentioned. > > Unfortunately, I am not sure, if you can export the inverted lists when > you read them. In egothor, I would use a listener in Rider class, in > Lucene, I would have to rewrite some classes and it could be a real > problem. Maybe, there is a solution I do not see... I have some extensions to Lucene that I've not yet commited which make it possible to easily define synthetic IndexReaders (not currently supported). So you could do things that way, once I check these in. But is this really better than just ANDing the clauses together? It would take some big experiments to know, but my guess is that it doesn't make much difference to compute a "local" IDF for such things. Doug