lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Unexpected: ordered
Date Tue, 05 Jul 2005 18:54:54 GMT
On Tuesday 05 July 2005 14:35, Dave Kor wrote:
> Quoting Paul Elschot <paul.elschot@xs4all.nl>:
> 
> > On Monday 04 July 2005 22:51, Dave Kor wrote:
> > > > I had another look at the code, and my guess now is that this is
> > > > related to the spanNear with the single argument.
> > > >
> > > > It rings some bells. One of them is that I would have preferred
> > > > to split the SpanNear class into ordered/unordered after the fix,
> > > > but that I gave up because it would take too much time.
> > > > The current SpanNear class is too complex for easy maintenance.
> > > >
> > > > Perhaps the quick fix is to verify in the constructor of SpanNearQuery
> > > > that the number of clauses is at least 2, and to throw an illegal arg
> > > > exception otherwise.
> > >
> > > Alright, I'll add code to ensure that I do not generate SpanNearQueries
> > that
> > > contain only a single sub-query and see what happens, I hope this solves my
> > > problem!
> > >
> > > Earlier, I went back to have a more in-depth look at the queries that were
> > > throwing these exceptions. My system, an experimental query expansion
> > module,
> > > had generated over 900+ queries and out of those, 50-60 queries cause the
> > RTE.
> > >
> > > From these queries, I can find many repeated multi-term SpanNearQueries
> > that
> > > also throws the same RTE. Here are some examples where the bracket shows
> > how
> > > the terms are grouped in a SpanNearQuery:
> > >
> > > ((the (regent hotel)) (the (regent hotel) to))
> > > (((elton john)) ((elton john) and))
> > > (((the who) is) ((the who) of))
> > > ((is) (the (the band nirvana) band))
> > > (((united states)) (united states president is the))
> > > (((academy awards) of) ((academy awards) is))
> >
> > In all these cases overlap between two matches can occur because they have
> > an equal subquery. The conclusion is that the current span code is not
> > capable
> > of handling such cases. It probably chokes at the moment the matches for
> > such subqueries concur.
> 
> I'm not quite sure what you mean here by "an equal subquery". I am not trying to
> get two subqueries to match the same portion of a document. Instead, I am
> looking for a repeat of the same search term(s) somewhere farther in the
> document.

I meant for example
(elton john)
occurring twice above.
 
> > The question is whether you would consider such a concurrence to be a match
> > for the query.
> > If so, the fix might be to return true instead of throwing the exception.
> 
> I have simplified the above examples by substituting the original search terms
> with more intelligible terms, which unfortunately made the above queries seem
> pointless. In reality, my system is trying to search for sentences that conform
> to certain linguistic structures.
> 
> An example of a useful search is a comma followed by another comma several words
> later, followed by the phrase "academy award winner". In other words
> 
> (, (, (academy award winner)~2)~3)~8
> 
> This search would pick up only sentences like "Dafoe , who played the role of
> Jesus in The Last Temptation of Christ , is also an Academy Award winner for
> his ... "
> 
> Hopefully, this explains what I am trying to achieve with Lucene and why I need
> to match repeated sub-queries. I would really appreciate it if anyone has a
> solution, a quickfix or can guide me in hacking up something workable.

So, in an ordered SpanNearQuery, you want repeated subqueries not to match the
same text/tokens, which boils down to non overlapping matches.

I had another look at NearSpans.java, and I'm afraid there is no quick fix for this
(but I'd like to be be proven wrong).
Spans can match ordered/unordered and overlapping/nonoverlapping.
Currently for the overlap there is no parameter, and I don't know how
SpanNearQuery behaves wrt. to overlapping matches.
There is no special case for equal subqueries, which is probably ok, but
when overlaps are allowed care should be taken not to use equal subqueries.

On hacking up something workable: it would be good to get this
bug out of NearSpans.

Anyway, to test this, eg. using the examples you gave above,
TestSpans.java here has some small code examples to start from:
http://svn.apache.org/viewcvs.cgi/lucene/java/tags/lucene_1_4_3/src/test/org/apache/lucene/search/spans/

TestBasics.java there has some larger examples.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message