lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From melix <cedric.champ...@lingway.com>
Subject Re: Span queries, API and difficulties
Date Sat, 22 Sep 2007 14:31:17 GMT

Hi Grant,

I'll try to isolate parts of the project in order to make a patch. It should
not take long but as I'm really busy don't expect it to soon ;) BTW, would
be simpler for me to get some help because there are things that seem hard
to understand (the problem I left at work yesterday was a mysterious next()
method on NearSpans that only has 1 submatch and returns true, while I
thought it couldn't be possible ;)).

As for the collect() extra parameter : a HitCollector (object, btw, think an
interface would be great there too) has only a collect(int doc, float score)
method. What I propose would be an extra 
collect(int doc, float score, DocumentMatchesHolder matches) method (if
matches==null, fallback on the default collect). I thought about an "Object"
because other people could need different data, but sure this makes more
sense with strong type.

Cedric


Grant Ingersoll-5 wrote:
> 
> Hi Cedric,
> 
> Thanks for the detailed response.  My suggestion would be to write up  
> a set of patches that demonstrate what you want for the SpanQuery  
> stuff, and the BooleanQuery stuff, preferably as separate patches.   
> The SpanQuery stuff makes the most sense to me and since I am slowly,  
> but surely, working on it, I could try to incorporate it.
> 
> As for the HitCollector, I am not exactly sure what you are trying to  
> get at there.  What Object is going to be passed in?  Is it the Match  
> object?  What would it mean for other implementations that aren't  
> using a Match object?  How would it be incorporated into Lucene for a  
> general case?  Again, a patch here may make it obvious.
> 
> -Grant
> 
> On Sep 22, 2007, at 5:45 AM, melix wrote:
> 
>>
>> Hi all,
>>
>> Sorry for the late response, I've been quite busy (working on my  
>> Lucene
>> tweak, and still not finished ;)). Basically, I need to be able to  
>> find out
>> what matched on a document basis on a complex query. For example,  
>> in a OR
>> clause, I need to know which of the sub(s) clause(s) have matched,  
>> and,
>> going deeper in the query tree, for each subclause itself, find out  
>> what
>> matched. This is made to be able to score documents with semantics
>> reasoning.
>>
>> As I want to limit breaking Lucene compatibility, I've decided to  
>> try, as
>> most as possible, to subclass Lucene classes. This is where it  
>> starts to be
>> difficult. So I've subclassed (most of) span queries classes so  
>> that the
>> getSpans() method returns my own span interface :
>>
>> public interface IExtendedSpans extends Spans,IMatcher {
>> }
>>
>> public interface IMatcher {
>>      Match match();
>> }
>>
>> The reason why I have a separate IMatcher interface is that span  
>> queries are
>> not the only queries which may "return" matches. We'll see this  
>> later. So I
>> implemented my own SpanNearQuery, which inherits the Lucene SNQ, so  
>> that
>> when a span is found, I can return the corresponding match. A match  
>> is a
>> collection of submatches, and I've decided to subclass the Match  
>> class for
>> each query type (this makes algorithms more readable, and easier to  
>> write).
>> For a span near query, the match() method will basically return a
>> SpanNearMatch, and so on.
>>
>> Problem : the Lucene span queries members are private -not  
>> protected-, so
>> subclasses cannot use them. For example, my subclass needs access  
>> to the
>> clauses, and I have to use the getter while I could directly use  
>> the member
>> (performance implication). Next, the spans subclasses are private  
>> static
>> classes, and I have to rewrite them to return *my* spans. So in this
>> particular point, this is really annoying because I have to copy  
>> the exact
>> inner classes (if not anonymous...) just to add my match() method.  
>> This is
>> annoying because by doing this, I'm potentially breaking  
>> compatibility with
>> future releases of Lucene.
>>
>> The problem was even harder when I had to add the match() method to  
>> the
>> BooleanQuery : this class is so complex, and uses so many protected  
>> or inner
>> classes (for optimization purposes, I must understand) that I would  
>> have to
>> copy a lot of the original source code just to add my method. As
>> documentation on how it works is really hard to find, I decided it  
>> would be
>> simpler if I wrote my own boolean queries (which is what I've done  
>> now). I
>> know it must be much less performant, but makes the tasks much easier.
>>
>> By the way, it would really be glad if the you could extract an  
>> interface
>> from the Query class. As all my queries implement an interface (to  
>> be sure
>> that you don't mix queries which support the match feature with  
>> ones that
>> don't), it would avoid many casts (the other solution would be that I
>> extract the interface myself and make my IMatchAwareQuery interface  
>> have
>> those methods, but I'm sure it would be cleaner if this was  
>> directly in
>> Lucene).
>>
>> Last but not least, it would be glad if the HitCollector class had a
>> collect() method with an Object parameter : the scoring I'm using  
>> cannot
>> just work on a collection of floats. It requires the matches, so  
>> I'm passing
>> a DocMatchesHolder instance to my HitCollector so that it can work  
>> on it.
>> This leads to the following (and not really clean) code recopied in  
>> my top
>> level Scorer implementations :
>>
>> public void score(HitCollector aHitCollector) throws IOException {
>> 		if (aHitCollector instanceof SearchingContext) {
>> 			SearchingContext ctx = (SearchingContext) aHitCollector;
>> 			while (next()) {
>> 				final DocMatchesHolder doc = docMatches();
>> 				final float score = score();
>> 				ctx.addHit(doc, score);
>> 				ctx.collect(doc(), score);
>> 			}
>> 		} else super.score(aHitCollector);
>> 	}
>>
>> Thanks for reading ;)
>>
>> Cedric
>> -- 
>> View this message in context: http://www.nabble.com/Span-queries%2C- 
>> API-and-difficulties-tf4500460.html#a12835063
>> Sent from the Lucene - Java Developer mailing list archive at  
>> Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
> 
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Span-queries%2C-API-and-difficulties-tf4500460.html#a12836259
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message