lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Multiword Highlighting
Date Thu, 15 Feb 2007 17:30:30 GMT
I hope you're all following this old thread, because I've just run into
something I don't quite know what to do about with the SpansExtractor code
that I shamelessly stole.

Let's say my text is "a b c d e f g h" and my query is "a AND z". The
implementation I stole for SpansExtractor (mentioned several times in this
thread) returns a span for "a" which doesn't preserve the sense of the
query. The root of the problem is that when it gets down to assembling the
getSpansFromTermQuery, the sense of "AND" is lost and I get span for the "a"
in the query.

The rest of the kinds of spans don't seem to have the same issue. OR should
return the "a" in the example above. Any phrase queries that come through
work fine. In fact, our application requires that we have an implied
proximity, mostly anyway, so I haven't had to deal with this until now.....

One way, it seems to me, to handle this would be to transform the query
above into a span query with a limit of 10,000, where 10,000 is a magic
number that I'm confident is OK in my application because of the
PositionIncrementGaps I set up during indexing.

Is there a more elegant way of doing this? Or am I missing the boat
entirely? Or did I mess up when I stole the code?

Or, and this would be the easiest for me at least, has this work already
been done and all I really need to do is get a different implementation of
SpansExtractor <G>?


On 2/2/07, Mark Miller <> wrote:
> mark harwood wrote:
> > Hi Mark,
> > Have you looked at the returned spans from any other potential problem
> scenarios (other than the 3 word one you suggest) e.g. complex nested
> "SpanOr" or "SpanNot" logic?
> >
> Nothing super intense, but I haved look at some semi complex nesting and
> it all looks great if you use the full span highlighting...highlighting
> the first and last word of the span only works great if your limited to
> word to word proximity searching (like in my parser <G> works great for
> my sentence and paragraph proximity searching, though i had to add the
> option of hiding my index marker tokens from the output)
> Perhaps you know of something that I haven't run into that may not
> highlight correctly ?
> > Can you attach your code to a new Jira entry so I can have a play?
> >
> >
> I certainly will.
> - Mark
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message