lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goddard, Michael J." <MICHAEL.J.GODD...@saic.com>
Subject RE: Question on highlighting of nested SpanQuery instances
Date Fri, 26 Feb 2010 15:32:34 GMT
Mark,

After making some changes to a few classes,

M      src/java/org/apache/lucene/search/spans/TermSpans.java
M      contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java
M      contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java
M      contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTerm.java

the JUnit test below passes.  I'm seeing some issues with other tests which I'll have to take
care of, and I'm not yet sure how I'll deal with Spans instances (as opposed to NearSpansOrdered,
NearSpansUnordered, and TermSpans), since it's an abstract class and I can't call getSubSpans()
on that.  I was thinking I ought to open a Jira issue for this, attach the current patch,
and just keep working.  Does this sound like something other users might find useful?

  Mike


-----Original Message-----
From: java-dev-return-46947-MICHAEL.J.GODDARD=saic.com@lucene.apache.org on behalf of Mark
Miller
Sent: Mon 2/22/2010 3:41 PM
To: java-dev@lucene.apache.org
Subject: Re: Question on highlighting of nested SpanQuery instances
 
I played with it sometime back, but I don't have any code left from that exercise.

Its fairly tricky. 

Take your example:

>   SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
>     new SpanTermQuery(new Term(fieldName, "lucene")),
>     new SpanTermQuery(new Term(fieldName, "doug")) }, 5, true);
>
>   Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
>     new SpanTermQuery(new Term(fieldName, "hadoop")) }, 4, true);

First you see the top level SpanNearQuery -

you want to recurse in and just work with the lucene within 5 of dog, ordered, part. But you
can't actually work with that alone. That whole span also has to be within 4 of hadoop ordered
... so how do you constrain the sub highlighting? Lets say you do it somehow.

Now you recurse in an want to highlight hadoop - but again, not every hadoop - only the haoops
that are within 4, ordered, of the first Span.

So that's really the issue - you want to break up the Span and highlight recursively - but
you can't really break them up and maintain all of the positional restrictions required.

So another possible option that gets a little messier might be:

when extracting the allowable positions for a term (which it does by checking the start and
end of span), you might also run each inner span that contains that term, and then intersect
the positions you find that way with the positions found with the overall span and use that
list as the allowable positions. That could get kind of complicated though, especially taking
into account the logic of the or and spannot spanqueries.

- Mark

On 02/22/2010 03:15 PM, Goddard, Michael J. wrote: 

	Mark,
	
	Thanks a lot for the insight.  I'm working with this today and, diving into the WeightedSpanTermExtractor
class and fiddling with it.  If you ever did have any code which attempted to recurse into
these structures, I'd be happy to get my hands on it.
	
	Thanks again.
	
	  Mike
	
	
	
	-----Original Message-----
	From: Mark Miller [mailto:markrmiller@gmail.com]
	Sent: Mon 2/22/2010 9:15 AM
	To: java-dev@lucene.apache.org
	Cc: Goddard, Michael J.
	Subject: Re: Question on highlighting of nested SpanQuery instances
	
	Hey Michael - this is currently just a limitation of the Span
	highlighter. It does a bit of fudging when determining what a good
	position is - if a term from the text is found within the span of a
	spanquery it is in (no matter how deeply nested), the highlighter makes
	a guess that the term should be highlighted - this is because we don't
	have the actual positions of each term - just the positions of the start
	and end of the span. In almost all cases this works as you would expect
	- but when nesting spans like this, you can get spurious results within
	the overall span.
	
	So your idea that we should recurse into the Span is on the right track
	- but it just gets fairly complicated quick. Consider
	SpanNear(SpanNear(mark, miller,3), SpanTerm(lucene), 4) - if we recurse
	in an grab the first SpanNear (mark, miller, 3), we can correctly
	highlight that - but then we will handle lucene by itself - so all
	lucene terms will be hit rather than the one within 4 of the first span.
	So you have to deal with SpanOr, SpanNear, SpanNot recursively, but then
	also handle when they are linked, either with each other or with a
	SpanTerm - and uh - its gets hard real fast. Hence the fuzziness that
	goes on now.
	
	There may be something we can do to improve things in the future, but
	its kind of an accepted limitation at the moment - prob something we
	should add some doc about.
	
	- Mark
	
	Goddard, Michael J. wrote:
	>
	> Hello,
	>
	> I initially posted a version of this question to java-user, but think
	> it's more of a java-dev question.  I haven't yet been able to resolve
	> why I'm seeing spurious highlighting in nested SpanQuery instances.
	> To illustrate this, I added the code below to the HighlighterTest
	> class in lucene_2_9_1:
	>
	> /*
	>  * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/
	>  */
	> public void testHighlightingNestedSpans2() throws Exception {
	>
	>   String theText = "The Lucene was made by Doug Cutting and Lucene
	> great Hadoop was"; // Problem
	>   //String theText = "The Lucene was made by Doug Cutting and the
	> great Hadoop was"; // Works okay
	>
	>   String fieldName = "SOME_FIELD_NAME";
	>
	>   SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
	>     new SpanTermQuery(new Term(fieldName, "lucene")),
	>     new SpanTermQuery(new Term(fieldName, "doug")) }, 5, true);
	>
	>   Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
	>     new SpanTermQuery(new Term(fieldName, "hadoop")) }, 4, true);
	>
	>   String expected = "The <B>Lucene</B> was made by <B>Doug</B>
Cutting
	> and Lucene great <B>Hadoop</B> was";
	>   //String expected = "The <B>Lucene</B> was made by <B>Doug</B>
	> Cutting and the great <B>Hadoop</B> was";
	>
	>   String observed = highlightField(query, fieldName, theText);
	>   System.out.println("Expected: \"" + expected + "\n" + "Observed: \""
	> + observed);
	>
	>   assertEquals("Why is that second instance of the term \"Lucene\"
	> highlighted?", expected, observed);
	> }
	>
	> Is this an issue that's arisen before?  I've been reading through the
	> source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor,
	> Spans, and NearSpansOrdered, but haven't found the solution yet.
	> Initially, I thought that the extractWeightedSpanTerms method in
	> WeightedSpanTermExtractor should be called on each clause of a
	> SpanNearQuery or SpanOrQuery, but that didn't get me too far.
	>
	> Any suggestions are welcome.
	>
	> Thanks.
	>
	>   Mike
	>
	
	
	--
	- Mark
	
	http://www.lucidimagination.com
	
	
	
	
	
	



-- 
- Mark

http://www.lucidimagination.com






Mime
View raw message