lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Request for clarification on unordered SpanNearQuery
Date Fri, 05 Mar 2010 20:39:10 GMT
Op vrijdag 05 maart 2010 18:12:12 schreef Goddard, Michael J.:
> Paul,
> 
> It seems like elaborating on my goal would be a good place to start.  Our users are accustomed
to a text engine they've been using for a while, one that supports nesting "span queries"
arbitrarily deeply with embedded wildcard and range queries, etc.  We would like to switch
the entire user community over to Lucene, but without impacting their work.  For them, it
is fairly easy to run a query using the old system, dump the highlighted result set to a file,
do the same using the Lucene version, and diff those.  When there is a difference, it causes
concern.
> 
> There are two issues I'm looking at now:
> 
> 1. When the query contains nested SpanNearQuery, why are terms matching the inner query
also highlighted in the parts of the text which don't match that inner query?  This is what
I'm looking at now, and Mark Miller has helped me quite a bit.  For this part, most of the
time the highlighting doesn't have to be super fast since it is being done when a single hit
is being viewed, so optimizing it for speed isn't a high priority.

I've never looked into the highlighting code of Lucene.


> 
> 2. Why do users have to increase the "slop" values in their queries when they contain
nested OR and other structures?  I haven't started this part yet, apart from the investigation
into issues related to #1.

Probably the users have a different understanding of spans than Lucene has.

There are quite a few variations possible for proximity queries involving spans.
SpanNearQuery only allows two of these variations.
Chances are that the variations that your users are used to are slightly different
from these two, so you might end up reimplementing the ones that your users
are used to on top of Lucene. When you have test cases for them,
start from these.

It could also be useful to have a look at the Surround language in contrib.
That query language allows nested span queries, OR and truncations.
It maps directly to Lucene's span queries.

Regards,
Paul Elschot


> 
> Does any of that make sense?
> 
>   Mike
> 
> 
> -----Original Message-----
> From: java-dev-return-47362-MICHAEL.J.GODDARD=saic.com@lucene.apache.org on behalf of
Paul Elschot
> Sent: Thu 3/4/2010 5:14 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Request for clarification on unordered SpanNearQuery
>  
> All possible unordered matches with a given slop can be so many that
> enumerating them all will be so slow that it is impractical for general use.
> 
> Note that I have not been very precise: one should also consider the
> same term indexed in the same position multiple times (not normal,
> but not impossible) and, last but not least, nested SpanNearQueries.
> 
> As Mark said, spans are funny beasts. Before starting these 40
> hours, you could try and discuss design ideas here.
> Could you elaborate on what you need to achieve?
> 
> Regards,
> Paul Elschot
> 
> Op donderdag 04 maart 2010 21:03:09 schreef Goddard, Michael J.:
> > Paul (and Mark),
> > 
> > Thank you for answering.  Do you suppose "not really straightforward" means "40
hours" or something like that?  I'm just trying to get an idea of whether what I'm attempting
is worth the effort.
> > 
> >   Mike
> > 
> > 
> > -----Original Message-----
> > From: java-dev-return-47351-MICHAEL.J.GODDARD=saic.com@lucene.apache.org on behalf
of Paul Elschot
> > Sent: Thu 3/4/2010 11:51 AM
> > To: java-dev@lucene.apache.org
> > Subject: Re: Request for clarification on unordered SpanNearQuery
> >  
> > Michael,
> > 
> > The test for the 4th range fails because the first matching subspans
> > (for t1 in this case) is always the one that is first advanced, and the first
> > match at that point has a less slop (0) than the maximum allowed (1)
> > so one might actually try and advance another subspans first.
> > But that is not really straightforward to implement, especially when different
> > terms can be indexed in the same position.
> > 
> > Perhaps the javadocs for the unordered case should be improved to mention
> > that in the unordered case the first subspans is always the one that is
> > advanced first.
> > 
> > Regards,
> > Paul Elschot
> > 
> > Op donderdag 04 maart 2010 17:34:26 schreef Goddard, Michael J.:
> > > I've been working on some highlighting changes involving Spans (https://issues.apache.org/jira/browse/LUCENE-2287)
and could use some help understanding when overlapping Spans are valid.  To illustrate, I
added the test below to the TestSpans class; this test fails because there is no fourth range.
> > > 
> > > Am I wrong in my expectation that that last range would match?
> > > 
> > > Thanks.
> > > 
> > >   Mike
> > > 
> > > 
> > >   // Doc 11 contains "t1 t2 t1 t3 t2 t3"
> > >   public void testSpanNearUnOrderedOverlap() throws Exception {
> > >     boolean ordered = false;
> > >     int slop = 1;
> > >     SpanNearQuery snq = new SpanNearQuery(
> > >                               new SpanQuery[] {
> > >                                 makeSpanTermQuery("t1"),
> > >                                 makeSpanTermQuery("t2"),
> > >                                 makeSpanTermQuery("t3") },
> > >                               slop,
> > >                               ordered);
> > >     Spans spans = snq.getSpans(searcher.getIndexReader());
> > >     
> > >     assertTrue("first range", spans.next());
> > >     assertEquals("first doc", 11, spans.doc());
> > >     assertEquals("first start", 0, spans.start());
> > >     assertEquals("first end", 4, spans.end());
> > >     
> > >     assertTrue("second range", spans.next());
> > >     assertEquals("second doc", 11, spans.doc());
> > >     assertEquals("second start", 1, spans.start());
> > >     assertEquals("second end", 4, spans.end());
> > >     
> > >     assertTrue("third range", spans.next());
> > >     assertEquals("third doc", 11, spans.doc());
> > >     assertEquals("third start", 2, spans.start());
> > >     assertEquals("third end", 5, spans.end());
> > >     
> > >     // Question: why wouldn't this Span be found?
> > >     assertTrue("fourth range", spans.next());
> > >     assertEquals("fourth doc", 11, spans.doc());
> > >     assertEquals("fourth start", 2, spans.start());
> > >     assertEquals("fourth end", 6, spans.end());
> > >     
> > >     assertFalse("fifth range", spans.next());
> > >   }
> > > 
> > > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> > 
> > 
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message