lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Question on highlighting of nested SpanQuery instances
Date Fri, 26 Feb 2010 15:41:49 GMT
Yeah, by all means open a JIRA issue. If you can get the old tests to 
pass as well as your new test, that would be fantastic.

On 02/26/2010 10:32 AM, Goddard, Michael J. wrote:
>
> Mark,
>
> After making some changes to a few classes,
>
> M      src/java/org/apache/lucene/search/spans/TermSpans.java
> M      
> contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java
> M      
> contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java
> M      
> contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTerm.java
>
> the JUnit test below passes.  I'm seeing some issues with other tests 
> which I'll have to take care of, and I'm not yet sure how I'll deal 
> with Spans instances (as opposed to NearSpansOrdered, 
> NearSpansUnordered, and TermSpans), since it's an abstract class and I 
> can't call getSubSpans() on that.  I was thinking I ought to open a 
> Jira issue for this, attach the current patch, and just keep working.  
> Does this sound like something other users might find useful?
>
>   Mike
>
>
> -----Original Message-----
> From: 
> java-dev-return-46947-MICHAEL.J.GODDARD=saic.com@lucene.apache.org on 
> behalf of Mark Miller
> Sent: Mon 2/22/2010 3:41 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Question on highlighting of nested SpanQuery instances
>
> I played with it sometime back, but I don't have any code left from 
> that exercise.
>
> Its fairly tricky.
>
> Take your example:
>
> >   SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
> >     new SpanTermQuery(new Term(fieldName, "lucene")),
> >     new SpanTermQuery(new Term(fieldName, "doug")) }, 5, true);
> >
> >   Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
> >     new SpanTermQuery(new Term(fieldName, "hadoop")) }, 4, true);
>
> First you see the top level SpanNearQuery -
>
> you want to recurse in and just work with the lucene within 5 of dog, 
> ordered, part. But you can't actually work with that alone. That whole 
> span also has to be within 4 of hadoop ordered ... so how do you 
> constrain the sub highlighting? Lets say you do it somehow.
>
> Now you recurse in an want to highlight hadoop - but again, not every 
> hadoop - only the haoops that are within 4, ordered, of the first Span.
>
> So that's really the issue - you want to break up the Span and 
> highlight recursively - but you can't really break them up and 
> maintain all of the positional restrictions required.
>
> So another possible option that gets a little messier might be:
>
> when extracting the allowable positions for a term (which it does by 
> checking the start and end of span), you might also run each inner 
> span that contains that term, and then intersect the positions you 
> find that way with the positions found with the overall span and use 
> that list as the allowable positions. That could get kind of 
> complicated though, especially taking into account the logic of the or 
> and spannot spanqueries.
>
> - Mark
>
> On 02/22/2010 03:15 PM, Goddard, Michael J. wrote:
>
>         Mark,
>
>         Thanks a lot for the insight.  I'm working with this today 
> and, diving into the WeightedSpanTermExtractor class and fiddling with 
> it.  If you ever did have any code which attempted to recurse into 
> these structures, I'd be happy to get my hands on it.
>
>         Thanks again.
>
>           Mike
>
>
>
>         -----Original Message-----
>         From: Mark Miller [mailto:markrmiller@gmail.com]
>         Sent: Mon 2/22/2010 9:15 AM
>         To: java-dev@lucene.apache.org
>         Cc: Goddard, Michael J.
>         Subject: Re: Question on highlighting of nested SpanQuery 
> instances
>
>         Hey Michael - this is currently just a limitation of the Span
>         highlighter. It does a bit of fudging when determining what a good
>         position is - if a term from the text is found within the span 
> of a
>         spanquery it is in (no matter how deeply nested), the 
> highlighter makes
>         a guess that the term should be highlighted - this is because 
> we don't
>         have the actual positions of each term - just the positions of 
> the start
>         and end of the span. In almost all cases this works as you 
> would expect
>         - but when nesting spans like this, you can get spurious 
> results within
>         the overall span.
>
>         So your idea that we should recurse into the Span is on the 
> right track
>         - but it just gets fairly complicated quick. Consider
>         SpanNear(SpanNear(mark, miller,3), SpanTerm(lucene), 4) - if 
> we recurse
>         in an grab the first SpanNear (mark, miller, 3), we can correctly
>         highlight that - but then we will handle lucene by itself - so all
>         lucene terms will be hit rather than the one within 4 of the 
> first span.
>         So you have to deal with SpanOr, SpanNear, SpanNot 
> recursively, but then
>         also handle when they are linked, either with each other or with a
>         SpanTerm - and uh - its gets hard real fast. Hence the 
> fuzziness that
>         goes on now.
>
>         There may be something we can do to improve things in the 
> future, but
>         its kind of an accepted limitation at the moment - prob 
> something we
>         should add some doc about.
>
>         - Mark
>
>         Goddard, Michael J. wrote:
> >
> > Hello,
> >
> > I initially posted a version of this question to java-user, but think
> > it's more of a java-dev question.  I haven't yet been able to resolve
> > why I'm seeing spurious highlighting in nested SpanQuery instances.
> > To illustrate this, I added the code below to the HighlighterTest
> > class in lucene_2_9_1:
> >
> > /*
> >  * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/
> >  */
> > public void testHighlightingNestedSpans2() throws Exception {
> >
> >   String theText = "The Lucene was made by Doug Cutting and Lucene
> > great Hadoop was"; // Problem
> >   //String theText = "The Lucene was made by Doug Cutting and the
> > great Hadoop was"; // Works okay
> >
> >   String fieldName = "SOME_FIELD_NAME";
> >
> >   SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
> >     new SpanTermQuery(new Term(fieldName, "lucene")),
> >     new SpanTermQuery(new Term(fieldName, "doug")) }, 5, true);
> >
> >   Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
> >     new SpanTermQuery(new Term(fieldName, "hadoop")) }, 4, true);
> >
> >   String expected = "The <B>Lucene</B> was made by <B>Doug</B>
Cutting
> > and Lucene great <B>Hadoop</B> was";
> >   //String expected = "The <B>Lucene</B> was made by <B>Doug</B>
> > Cutting and the great <B>Hadoop</B> was";
> >
> >   String observed = highlightField(query, fieldName, theText);
> >   System.out.println("Expected: \"" + expected + "\n" + "Observed: \""
> > + observed);
> >
> >   assertEquals("Why is that second instance of the term \"Lucene\"
> > highlighted?", expected, observed);
> > }
> >
> > Is this an issue that's arisen before?  I've been reading through the
> > source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor,
> > Spans, and NearSpansOrdered, but haven't found the solution yet.
> > Initially, I thought that the extractWeightedSpanTerms method in
> > WeightedSpanTermExtractor should be called on each clause of a
> > SpanNearQuery or SpanOrQuery, but that didn't get me too far.
> >
> > Any suggestions are welcome.
> >
> > Thanks.
> >
> >   Mike
> >
>
>
>         --
>         - Mark
>
> http://www.lucidimagination.com
>
>
>
>
>
>
>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
>


-- 
- Mark

http://www.lucidimagination.com




Mime
View raw message