lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leo Galambos <Le...@seznam.cz>
Subject Re: Lucene features
Date Sun, 07 Sep 2003 15:23:19 GMT
Erik Hatcher wrote:

> On Friday, September 5, 2003, at 07:45  PM, Leo Galambos wrote:
>
>>> And for the second time today.... QueryFilter.  It allows narrowing 
>>> the documents queried to only the documents from a previous Query.
>>
>>
>>
>> I guess, it would not be an ideal solution - the first query does two 
>> things a) it selects a subset from the corpus; b) it assigns a 
>> relevance to each document of this subset. Your solution omits the 
>> second point. It implies, the solution will not return "good hit 
>> lists", because you will not consider the information value of the 
>> first query which was given to you by a user.
>
>
> Yes, you're right.  Getting the scores of a second query based on the 
> scores of the first query is probably not trivial, but probably 
> possible with Lucene.  And that combined with a QueryFilter would do 
> the trick I suspect.  Somehow the scores of the first query could be 
> remembered and used as a boost (or other type of factor) the scores of 
> the second query.


Well, I do not want to be a pessimist, but the boost vector is not a 
good solution due to CWI statistics. On the other hand, it is much 
better than the simple QueryFilter which, in fact, works as 0/1 boost.

Example: I use this notation: inverted_list_term:{list of W values, "-" 
denotes W=0, for 12 documents in a collection}
A:{23[16]------27}
B:{--[38]--------}
C:{18[2-]45239812}
If your first query is B, the subset of documents (denoted by brackets - 
namely, the 3rd and 4th doc) is selected, and if your second query is "A 
C", then you cannot use global IDFs, because in the subset, the IDF 
factors are different. Globally, A is better distriminator, but in the 
subset, C is better. This fact is then reflected by the hit list you 
generate, and I guess, the quality will be also affected by this.

The example shows, that you would rather export the subset to an 
auxiliary index (RAMDirectory?) and then use this structure instead of 
the original index. Obviously, it will solve the issue of speed you 
mentioned.

Unfortunately, I am not sure, if you can export the inverted lists when 
you read them. In egothor, I would use a listener in Rider class, in 
Lucene, I would have to rewrite some classes and it could be a real 
problem. Maybe, there is a solution I do not see...

Your turn ;-)
Cheers,
Leo

>
> Am I off base here?
>
>> Thus I think, Chris would implement something more complex than 
>> QueryFilter. If not, the results will be poorer than with the 
>> commercial packages he may get. He could use a different model where 
>> "AND" is not an associative operator (i.e. some modification of the 
>> extended Boolean model). It implies, he would implement it in 
>> Similarity.java (if I remember that class name correctly).
>
>
> Right... but you'd still need the filtering capability as well, I 
> would think - at least for performance reasons.
>
>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>




Mime
View raw message