lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <DOR...@il.ibm.com>
Subject Re: wildcards and spans
Date Sat, 05 Aug 2006 02:33:13 GMT
A thought - would you (or the project lead;-) consider limiting the
'wildcard expansion'?

Assuming a query like:
          ( uni* near(5) science )
I.e. match docs with any word with prefix "uni" that spans no further than
5 from the word "science". Assume current lexicon has M (say 1200) words
starting with "uni", you could (manually) select the first N (say 50) words
from this list. You could also select the 'best' N words - defining 'best'
in correlation with those words DF for instance.

This has a few issues:
- what N value is good enough?
- how to select the N words?
- you would have to interfere with the code that expands the wildcard.
- search results recall may degrade because it might happen that words
selected as 'best' would not pass the 'span test' while some of the words
that were not selected would have passed the span test.

But it might be practical, and perhaps, hopefully, satisfactory.

Regards,
Doron

"Erick Erickson" <erickerickson@gmail.com> wrote on 02/08/2006 11:17:19:

> I'm almost entirely certain that any value I choose for setMaxClauseCount
is
> going to be wrong, but I might give it a try.
>
> Erick
>
> On 8/2/06, Paul Elschot <paul.elschot@xs4all.nl> wrote:
> >
> > On Wednesday 02 August 2006 17:29, Erick Erickson wrote:
> > > I'm back, with another flavor of wildcards. What direction would you
> > point a
> > > poor boy who's project lead wants wildcard queries and spans? Here's
the
> > > problem....
> > >
> > > I cannot use any of the classes that throw a "TooManyClauses"
exception
> > (e.g.
> > > SpanRegexQuery or SpanNearQuery with, say WildCardQuery). The corpus
is
> > big
> > > enough that this is guaranteed to be thrown. So, currently I'm using
a
> > > filter for wildcard queries, populating it via WildcardTermEnum and
> > > TermDocs... Works like a champ. But I don't see how to combine this
with
> > > spans...
> >
> > You can try BooleanQuery.setMaxClauseCount() to  increase the max. nr.
of
> > clauses to 100000 or so and see what happens when searching.
> > With enough RAM it should work nicely.
> >
> > You could also use the surround query language. This allows to set
> > the max. nr. of clauses for a whole query instead of per BooleanQuery.
> >
> > Regards,
> > Paul Elschot
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message