lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Giulio Cesare Solaroli <giulio.ces...@gmail.com>
Subject Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
Date Fri, 12 Nov 2004 14:11:26 GMT
Hi all,

I am cross-posting my reply also to developer list because I think some of
my arguments belong there.

I was thinking about extending somehow the PhraseQuery analyzer in
order to better handle wild character expansion.

Sanyi idea to "optimize" the expansion of the terms to include just the ones
meaningful for the subset of documents found by other part of the
query is intriguing, but probably very difficult to implement.

My idea will probably more easy to implement, even if the final result
could be not 100% exact, it could probably be good enough. The idea is
to let the developer handle the boolean query limit in the following
way:
- leave the current implementation, raising an exception;
- handle the exception and limit the boolean query to the first 1024
(or what ever the limit is) terms;
- select, between the possible terms, only the first 1024 (or what
ever the limit is) more meaningful ones, leaving out all the others.

I had this idea watching how some terms where expanded against our
index. Many of them where clearly wrong words, filenames, or any other
kind of irrelevant info that was not easy to remove before indexing.

This solution changes the return results in a subtle way (even if only
when the current implementation is throwing an exception) and so the
developer should be very careful to report to her users that the query
could have left out some documents.

The "most meaningful", in this context, could be proportionate to the
number of documents having that term in the whole index, as a first
approximation.

Does this idea sounds interesting to any of you?

Regards,

Giulio Cesare Solaroli



On Thu, 11 Nov 2004 11:57:32 -0800 (PST), Sanyi <need4sid@yahoo.com> wrote:
> Yes, I understand all of this, but I don't want to set it to MaxInt, since it can easily
lead to
> (even accidental) DoS attacks.
> 
> What I'm saying is that there is no reason for the optimizer to expand wild* to more
than 1024
> variations when I search for "somerareword AND wild*", since somerareword is only present
in let's
> say 100 documents, so wild* should only expand to words beginning with "wild" in those
100
> documents, then it should work fine with the default 1024 clause limit.
> 
> But it doesn't, so I can choose between unuseable queries or accidental DoS attacks.
> 
> 
> 
> --- Will Allen <wallen@Cyveillance.com> wrote:
> 
> > Any wildcard search will automatically expand your query to the number of terms
it find in the
> > index that suit the wildcard.
> >
> > For example:
> >
> > wild*, would become wild OR wilderness OR wildman etc for each of the terms that
exist in your
> > index.
> >
> > It is because of this, that you quickly reach the 1024 limit of clauses.  I automatically
set it
> > to max int with the following line:
> >
> > BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
> >
> >
> > -----Original Message-----
> > From: Sanyi [mailto:need4sid@yahoo.com]
> > Sent: Thursday, November 11, 2004 6:46 AM
> > To: lucene-user@jakarta.apache.org
> > Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
> >
> >
> > Hi!
> >
> > First of all, I've read about BooleanQuery$TooManyClauses, so I know that it has
a 1024 Clauses
> > limit by default which is good enough for me, but I still think it works strange.
> >
> > Example:
> > I have an index with about 20Million documents.
> > Let's say that there is about 3000 variants in the entire document set of this word
mask: cab*
> > Let's say that about 500 documents are containing the word: spectrum
> > Now, when I search for "cab* AND spectrum", I don't expect it to throw an exception.
> > It should first restrict the search for the 500 documents containing the word "spectrum",
then
> > it
> > should collect the variants of "cab*" withing these documents, which turns out in
two or three
> > variants of "cab*" (cable, cables, maybe some more) and the search should return
let's say 10
> > documents.
> >
> > Similar example: When I search for "cab* AND nonexistingword" it still throws a
TooManyClauses
> > exception instead of saying "No results", since there is no "nonexistingword" in
my document
> > set,
> > so it doesn't even have to start collecting the variations of "cab*".
> >
> > Is there any path for this issue?
> > Thank you for your time!
> >
> > Sanyi
> > (I'm using: lucene 1.4.2)
> >
> > p.s.: Sorry for re-sending this message, I was first sending it as an accidental
reply to a
> > wrong thread..
> >
> >
> >
> > __________________________________
> > Do you Yahoo!?
> > Check out the new Yahoo! Front Page.
> > www.yahoo.com
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> 
> __________________________________
> Do you Yahoo!?
> Check out the new Yahoo! Front Page.
> www.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message