lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Rewrite for RegexpQuery
Date Mon, 11 Mar 2013 17:41:04 GMT
I think we have here different problems:

Carsten wants to just collect the terms a MTQ visits, so using BooleanQuery to do this is
fine, unless you hit the limit. If you don’t execute the query, the limit can be as high
as possible (but it’s a static limit affecting all instances). To do the same you can use
another approach: Implement your own TermCollectingRewrite subclass, that simply adds a terms
collected into a custom HashSet or whatever. You just have to implement the addClause and
getTopLevelQuery methods in TermCollectingRewrite and return the set later (just use a "fake"
query as holder for the HashSet). I did something similar in the past to implement a MultiPhraseQuery
with MTQs like wildcards, regexes or fuzzys as clauses (I hope, I can donate it soon). The
custom rewrite would be the most efficient way to get the list of terms (if you rely on a
query as input).

On the other hand, to collect all terms for a wildcard, don’t use the Query at all, just
wrap the reader's TermsEnum using one of the classes from the search package, like AutomatonTermsEnum
(which takes a regex in its ctor) and filters all terms in the index according to the automaton
(which may be a regex).

Finally, if you actually want to execute the query, using a scoring rewrite is a bad idea
and 1024 is too large, too.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Monday, March 11, 2013 6:23 PM
> To: java-user@lucene.apache.org
> Subject: Re: Rewrite for RegexpQuery
> 
> On Mon, Mar 11, 2013 at 9:32 AM, Carsten Schnober <schnober@ids-
> mannheim.de> wrote:
> > Am 11.03.2013 13:38, schrieb Michael McCandless:
> >> On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler <uwe@thetaphi.de>
> wrote:
> >>
> >>> Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE,
> then this should work (after rewrite your query is a BooleanQuery, which
> supports extractTerms()).
> >>
> >> ... as long as you don't exceed the max number of terms allowed by BQ
> >> (1024 by default, but you can raise it).
> >
> > True, I've noticed this meanwhile. Are there any recommendations for
> > this setting where the limit is as large as possible while staying
> > within a reasonable performance? Of course, this is highly subjective,
> > but what's the magnitude here? Will a limit of 1,024,000 typically
> > increase the query time by the factor 1,000 too?
> > Carsten
> 
> I think 1024 may already be too high ;)
> 
> But really it depends on your situation: test different limits and see.
> 
> How much slower a larger query is depends on the specifics of the terms ...
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message