lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Kor <s0454...@sms.ed.ac.uk>
Subject Re: Unexpected: ordered
Date Tue, 05 Jul 2005 12:35:15 GMT
Quoting Paul Elschot <paul.elschot@xs4all.nl>:

> On Monday 04 July 2005 22:51, Dave Kor wrote:
> > > I had another look at the code, and my guess now is that this is
> > > related to the spanNear with the single argument.
> > >
> > > It rings some bells. One of them is that I would have preferred
> > > to split the SpanNear class into ordered/unordered after the fix,
> > > but that I gave up because it would take too much time.
> > > The current SpanNear class is too complex for easy maintenance.
> > >
> > > Perhaps the quick fix is to verify in the constructor of SpanNearQuery
> > > that the number of clauses is at least 2, and to throw an illegal arg
> > > exception otherwise.
> >
> > Alright, I'll add code to ensure that I do not generate SpanNearQueries
> that
> > contain only a single sub-query and see what happens, I hope this solves my
> > problem!
> >
> > Earlier, I went back to have a more in-depth look at the queries that were
> > throwing these exceptions. My system, an experimental query expansion
> module,
> > had generated over 900+ queries and out of those, 50-60 queries cause the
> RTE.
> >
> > From these queries, I can find many repeated multi-term SpanNearQueries
> that
> > also throws the same RTE. Here are some examples where the bracket shows
> how
> > the terms are grouped in a SpanNearQuery:
> >
> > ((the (regent hotel)) (the (regent hotel) to))
> > (((elton john)) ((elton john) and))
> > (((the who) is) ((the who) of))
> > ((is) (the (the band nirvana) band))
> > (((united states)) (united states president is the))
> > (((academy awards) of) ((academy awards) is))
>
> In all these cases overlap between two matches can occur because they have
> an equal subquery. The conclusion is that the current span code is not
> capable
> of handling such cases. It probably chokes at the moment the matches for
> such subqueries concur.

I'm not quite sure what you mean here by "an equal subquery". I am not trying to
get two subqueries to match the same portion of a document. Instead, I am
looking for a repeat of the same search term(s) somewhere farther in the
document.

> The question is whether you would consider such a concurrence to be a match
> for the query.
> If so, the fix might be to return true instead of throwing the exception.

I have simplified the above examples by substituting the original search terms
with more intelligible terms, which unfortunately made the above queries seem
pointless. In reality, my system is trying to search for sentences that conform
to certain linguistic structures.

An example of a useful search is a comma followed by another comma several words
later, followed by the phrase "academy award winner". In other words

(, (, (academy award winner)~2)~3)~8

This search would pick up only sentences like "Dafoe , who played the role of
Jesus in The Last Temptation of Christ , is also an Academy Award winner for
his ... "

Hopefully, this explains what I am trying to achieve with Lucene and why I need
to match repeated sub-queries. I would really appreciate it if anyone has a
solution, a quickfix or can guide me in hacking up something workable.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message