lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From melix <cedric.champ...@lingway.com>
Subject Span queries, API and difficulties
Date Sat, 22 Sep 2007 09:45:12 GMT

Hi all,

Sorry for the late response, I've been quite busy (working on my Lucene
tweak, and still not finished ;)). Basically, I need to be able to find out
what matched on a document basis on a complex query. For example, in a OR
clause, I need to know which of the sub(s) clause(s) have matched, and,
going deeper in the query tree, for each subclause itself, find out what
matched. This is made to be able to score documents with semantics
reasoning.

As I want to limit breaking Lucene compatibility, I've decided to try, as
most as possible, to subclass Lucene classes. This is where it starts to be
difficult. So I've subclassed (most of) span queries classes so that the
getSpans() method returns my own span interface :

public interface IExtendedSpans extends Spans,IMatcher {
}

public interface IMatcher {
     Match match();
}

The reason why I have a separate IMatcher interface is that span queries are
not the only queries which may "return" matches. We'll see this later. So I
implemented my own SpanNearQuery, which inherits the Lucene SNQ, so that
when a span is found, I can return the corresponding match. A match is a
collection of submatches, and I've decided to subclass the Match class for
each query type (this makes algorithms more readable, and easier to write).
For a span near query, the match() method will basically return a
SpanNearMatch, and so on.

Problem : the Lucene span queries members are private -not protected-, so
subclasses cannot use them. For example, my subclass needs access to the
clauses, and I have to use the getter while I could directly use the member
(performance implication). Next, the spans subclasses are private static
classes, and I have to rewrite them to return *my* spans. So in this
particular point, this is really annoying because I have to copy the exact
inner classes (if not anonymous...) just to add my match() method. This is
annoying because by doing this, I'm potentially breaking compatibility with
future releases of Lucene.

The problem was even harder when I had to add the match() method to the
BooleanQuery : this class is so complex, and uses so many protected or inner
classes (for optimization purposes, I must understand) that I would have to
copy a lot of the original source code just to add my method. As
documentation on how it works is really hard to find, I decided it would be
simpler if I wrote my own boolean queries (which is what I've done now). I
know it must be much less performant, but makes the tasks much easier.

By the way, it would really be glad if the you could extract an interface
from the Query class. As all my queries implement an interface (to be sure
that you don't mix queries which support the match feature with ones that
don't), it would avoid many casts (the other solution would be that I
extract the interface myself and make my IMatchAwareQuery interface have
those methods, but I'm sure it would be cleaner if this was directly in
Lucene).

Last but not least, it would be glad if the HitCollector class had a
collect() method with an Object parameter : the scoring I'm using cannot
just work on a collection of floats. It requires the matches, so I'm passing
a DocMatchesHolder instance to my HitCollector so that it can work on it.
This leads to the following (and not really clean) code recopied in my top
level Scorer implementations :

public void score(HitCollector aHitCollector) throws IOException {
		if (aHitCollector instanceof SearchingContext) {
			SearchingContext ctx = (SearchingContext) aHitCollector;
			while (next()) {
				final DocMatchesHolder doc = docMatches();
				final float score = score();
				ctx.addHit(doc, score);
				ctx.collect(doc(), score);
			}
		} else super.score(aHitCollector);
	}

Thanks for reading ;)

Cedric
-- 
View this message in context: http://www.nabble.com/Span-queries%2C-API-and-difficulties-tf4500460.html#a12835063
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message