lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Rewrite for RegexpQuery
Date Mon, 11 Mar 2013 17:48:51 GMT
If you are interested, here is the solution with the "fake" query as rewrite. Just use GetTermsRewrite
as rewrite method. The MTQ then rewrites to TermHolderQuery (cast to that) and you can get
the terms using getTerms():

  /** A fake query that is just used to collect all term instances for the {@link ScoringRewrite}
API. */
  final class TermHolderQuery extends Query {
    private final HashSet<Term> terms = new HashSet<Term>();

    @Override
    public String toString(String defaultField) {
      return getClass().getSimpleName() + terms;
    }
    
    void add(Term term) {
      terms.add(term);
    }
    
    Set<Term> getTerms() {
      return Collections.unmodifiableSet(terms);
    }
  }
  
  final class GetTermsRewrite extends ScoringRewrite<TermHolderQuery> {
    @Override
    protected void addClause(TermHolderQuery topLevel, Term term, float boost) {
      topLevel.add(term);
    }

    @Override
    protected TermHolderQuery getTopLevelQuery() {
      return new TermHolderQuery();
    }
  }
  

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Uwe Schindler [mailto:uwe@thetaphi.de]
> Sent: Monday, March 11, 2013 6:41 PM
> To: java-user@lucene.apache.org
> Subject: RE: Rewrite for RegexpQuery
> 
> I think we have here different problems:
> 
> Carsten wants to just collect the terms a MTQ visits, so using BooleanQuery
> to do this is fine, unless you hit the limit. If you don’t execute the query, the
> limit can be as high as possible (but it’s a static limit affecting all instances).
To
> do the same you can use another approach: Implement your own
> TermCollectingRewrite subclass, that simply adds a terms collected into a
> custom HashSet or whatever. You just have to implement the addClause and
> getTopLevelQuery methods in TermCollectingRewrite and return the set
> later (just use a "fake" query as holder for the HashSet). I did something
> similar in the past to implement a MultiPhraseQuery with MTQs like
> wildcards, regexes or fuzzys as clauses (I hope, I can donate it soon). The
> custom rewrite would be the most efficient way to get the list of terms (if
> you rely on a query as input).
> 
> On the other hand, to collect all terms for a wildcard, don’t use the Query at
> all, just wrap the reader's TermsEnum using one of the classes from the
> search package, like AutomatonTermsEnum (which takes a regex in its ctor)
> and filters all terms in the index according to the automaton (which may be a
> regex).
> 
> Finally, if you actually want to execute the query, using a scoring rewrite is a
> bad idea and 1024 is too large, too.
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> 
> > -----Original Message-----
> > From: Michael McCandless [mailto:lucene@mikemccandless.com]
> > Sent: Monday, March 11, 2013 6:23 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Rewrite for RegexpQuery
> >
> > On Mon, Mar 11, 2013 at 9:32 AM, Carsten Schnober <schnober@ids-
> > mannheim.de> wrote:
> > > Am 11.03.2013 13:38, schrieb Michael McCandless:
> > >> On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler <uwe@thetaphi.de>
> > wrote:
> > >>
> > >>> Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE,
> > then this should work (after rewrite your query is a BooleanQuery,
> > which supports extractTerms()).
> > >>
> > >> ... as long as you don't exceed the max number of terms allowed by
> > >> BQ
> > >> (1024 by default, but you can raise it).
> > >
> > > True, I've noticed this meanwhile. Are there any recommendations for
> > > this setting where the limit is as large as possible while staying
> > > within a reasonable performance? Of course, this is highly
> > > subjective, but what's the magnitude here? Will a limit of 1,024,000
> > > typically increase the query time by the factor 1,000 too?
> > > Carsten
> >
> > I think 1024 may already be too high ;)
> >
> > But really it depends on your situation: test different limits and see.
> >
> > How much slower a larger query is depends on the specifics of the terms ...
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message