lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Claudio Corsi" <claudio.co...@gmail.com>
Subject Re: Fwd: SpanNearQuery: how to get the "intra-span" matching positions?
Date Mon, 09 Jun 2008 08:36:56 GMT
Hi!
I've implemented a non efficient (but working) solution to the "intra-span"
matching problem. With these modifications (see the attachment) I have a way
to pick all the matching positions *inside* the current NearSpan using the
new method matchingSpans() (to call after each next()).

There are three files:

1) NearSpans.java (just an interface declaring the matchingSpans method, I
need it for my framework, but it is not mandatory)
2) NearSpansUnordered.java
3) NearSpansOrdered.java

(the package declaration is relative to my project, please ignore it!! ;)

The files 2) and 3) are copies of the one I found in Lucene
2.3.1.NearSpansUnordered just provide the implementation of
matchinSpans() without
any other modifications to the code: it just cycles over the list of
SpanCell kept inside the current instance and filter out the elements whose
doc() is not equals to the current doc() of the Spans, and whose
start()/end() mathcing positions are not "compatible" with the ones of the
current Spans state.

The case of NearSpansOrdered is a little bit more complicated. I had to
maintain track of the subSpans states in the method
shrinkToAfterShortestMatch(). So, I've introduced the subSpansCopy ArrayList
and the inner class SpansCopy that just copies the doc()/start()/end()
values of the passed span. Then I've used this list in the implementation of
matchingSpans() in a way similar to NearSpansUnorderd.

These copies and the matchingSpans() implementations are not very efficient.
I think that this problem can be solved in a better way. But for my
collection and my application it works fine and fast.

Hope that these files will help someone else!

Cheers.


On Fri, Jun 6, 2008 at 6:34 PM, Paul Elschot <paul.elschot@xs4all.nl> wrote:

> See below.
>
> Op Friday 06 June 2008 16:23:15 schreef Claudio Corsi:
> > Hi,
> > I'm trying to extend the NearSpansOrdered and NearSpansUnordered
> > classes of the Lucene core in order to create a way to access to the
> > inner positions of the current span (in a next() loop). Suppose that
> > the current near span starts at position N and ends at position N+k,
> > I would discover the starting/ending positions of all the inner
> > clauses that generate such span.
> >
> > I'm working on the NearSpansOrdered class right now. I guess that
> > this modification could be trivial to do, but it requires to me time
> > to understand the code. Any hints about that?
> >
> > Actually (as a very inefficient way to proceed) I've added this
> > method to call *after each next()*, but it doesn't work as aspected:
> >
> > public Spans[] matchingSpans() {
> >
> >       ArrayList<Spans> list = new ArrayList<Spans>();
> >       if (subSpans.length == 0) return null;
> >       for(int pos = 0; pos < subSpans.length; pos++) {
> >           if (subSpans[pos].doc() != matchDoc) continue;
> >           if (subSpans[pos].start() >= matchStart &&
> > subSpans[pos].end() <= matchEnd)
> >           list.add(subSpans[pos]);
> >       }
> >       return list.toArray(new Spans[0]);
> > }
> >
> > As you can see, I'm just looping over the subSpans array, filtering
> > the ones having doc() == matchDoc and which span starts/end inside
> > the current near span (matchStart and matchEnd are the boundaries
> > returned by start() and ends() of NearSpansOrdered). This technique
> > doesn't work. Maybe the problem is that the subSpans are not in the
> > rigth state afte the next() call?
>
> Correct. The reason is that a match must be minimal length,
> and for that at least the matching subspans at the lowest
> position needs to be advanced beyond its matching position.
> This is the same for both the ordered and unordered case.
>
> So, to implement the matchingSpans() method, it will be necessary
> to copy the subspans when they are at the matching position.
> This will probably involve some fruitless copying for incomplete
> matches that never become a real match.
>
> There is also a difference beyond ordered/unordered.
> In the ordered case, no overlaps between the matching subspans
> are allowed, and in the unordered case overlaps are allowed.
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Claudio Corsi

Mime
View raw message