lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <paul_t...@fastmail.fm>
Subject Re: Uable to extends TopTermsRewrite in Lucene 4.1
Date Thu, 04 Apr 2013 10:46:57 GMT
On 04/04/2013 10:59, Paul Taylor wrote:
> On 27/02/2013 10:28, Uwe Schindler wrote:
>> Hi Paul,
>>
>> QueryParser and MTQ's rewrite method have nothing to do with each 
>> other. The rewrite method is (explained as simple as possible) a 
>> class that is responsible to "rewrite" a MultiTermQuery to another 
>> query type (generally a query that allows to add "Term" instances, 
>> e.g. BooleanQuery of TermQuery or DisjunctionMaxQuery of Terms). The 
>> rewrite method takes the "filtered" terms enum provided by the query 
>> and creates a combined query out of it. Lucene ships with some 
>> already implemented rewrite methods based on abstract classes that 
>> handle the most common cases:
>>
>> - ScoringRewrite handles the case where you want to collect the terms 
>> from the termsenum and place them as "clauses" in a top level query 
>> (e.g. a scoring BooleanQuery). You have to implement 2 abstract 
>> methods that produce the top-level query and create the clauses, that 
>> can be added to the top-level query. This class is generic to the 
>> top-level query, as the clauses can only be added to the correct 
>> top-level query. To make this work without casting, all methods are 
>> redefined to take the generics classes. So addClause() takes the 
>> generic top level query and a term. The rewrite method by itself 
>> returns the top level query
>> - TopTermsRewrite is similar, but has a major difference: It has 
>> almost same API, but the internal implementation of this class is 
>> different: It never hits the Boolean Max Clause Count, because the 
>> collected terms are ordered in a priority queue and only the 
>> top-ranking terms are added to the resulting top-level query. This 
>> class is also generified against the top-level query. Rewrite returns 
>> an instance of the top-level query.
>> - The very base class MultiTermQuery.RewriteMethod is most flexible 
>> but has no concrete implementation. It is used to rewrite a MTQ to a 
>> query that is not a composite top-level one with a number of terms, 
>> e.g. a filter that’s handled in a totally different stage of rewriting.
>>
>> You can use the same MTQ rewrite for different MTQ types, e.g. you 
>> can rewrite a FuzzyQuery to a simple ConstantScore Query or a 
>> DisjunctionMaxQuery - but only the second one makes sense. On the 
>> other hand it makes no sense to rewrite Prefix and Wildcard using 
>> TopTermsRewrite, as those queries have terms enums withouth term 
>> boosts (only Fuzzy assigns a boost to every term depending on 
>> levensthein distance).
>>
>> Things to note:
>> A rewrite method in MTQ would never rewrite to another MTQ like 
>> PrefixQuery - it could do this, but only in the lowest base class 
>> (see above)! -> If you rely on that, your code has a major problem. 
>> In that case the correct behavior would be to create a completely 
>> "own"oal.search.Query (that not extends MTQ) and implement a standard 
>> rewrite logic. This query could of course rewrite to MTQ's like Fuzzy 
>> or Prefix. IndexSearcher rewrites the query until it is completely 
>> rewritten, so your custom query would create a PrefixQuery which 
>> itself rewrites to something else.
>>
>> QueryParser is just a factory for queries, its not related to MTQ. It 
>> only has an option to set a "default" method for common queries. But 
>> as you have a custom QueryParser, you can return the queries, 
>> configured like you want, to the caller.
>>
>> Uwe
>>
> Hi Uwe
>
> Okay, think I have it now. Now have a working rewrite method for Fuzzy 
> Queries
>
>     public static class FuzzyTermRewrite<Q extends 
> DisjunctionMaxQuery> extends TopTermsRewrite<Query> {
>
>         public FuzzyTermRewrite(int size) {
>             super(size);
>         }
>
>         @Override
>         protected int getMaxSize() {
>             return BooleanQuery.getMaxClauseCount();
>         }
>
>         @Override
>         protected DisjunctionMaxQuery getTopLevelQuery() {
>             return new DisjunctionMaxQuery(0.1f);
>         }
>
>         @Override
>         protected void addClause(Query topLevel, Term term, int 
> docCount, float boost, TermContext states) {
>             final Query tq = new ConstantScoreQuery(new 
> TermQuery(term, states));
>             tq.setBoost(boost);
>             ((DisjunctionMaxQuery)topLevel).add(tq);
>         }
>     }
>
> and now writing a separate class for Prefix Queries so it does 
> actually modify the idf
>
> Paul
>

and this is my prefix rewrite method:

/**
      *
      * Prefix matches are rewritten to a DisjunctionMaxQuery instead of 
the more usual BooleanQuery so that
      * if search term matches multiple fields we just take the best 
field rather summing all matches like a boolean
      * query. The 0.1 for tiebreaker is to favour documents that 
contain all words rather than the same word in multiple
      * fields.
      *
      * We set the idf the same as an exact match so that a wildcard 
match to a term which happens to be rarer than
      * the exact term we were searching for does not get an unfairly 
high idf.
      *
      */
     public static class PrefixTermRewrite extends 
MultiTermQuery.RewriteMethod {

         private TFIDFSimilarity     similarity;
         private FuzzyTermRewrite    rewrite;

         public PrefixTermRewrite(int size) {
             this.rewrite    = new FuzzyTermRewrite(size);
             this.similarity = new DefaultSimilarity();
         }

         protected float getQueryBoost(final IndexReader reader, final 
MultiTermQuery query)
                 throws IOException {
             float idf = 1f;
             float df;
             PrefixQuery fq = (PrefixQuery) query;
             df = reader.docFreq(fq.getPrefix());
             if(df>=1)
             {
                 //Same as idf value for search term, 0.5 acts as length 
norm
                 idf = (float)Math.pow(similarity.idf((int) df, 
reader.numDocs()),2) * 0.5f;
             }
             return idf;
         }


         @Override
         public Query rewrite(final IndexReader reader, final 
MultiTermQuery query) throws IOException {
             DisjunctionMaxQuery  dmq = 
(DisjunctionMaxQuery)rewrite.rewrite(reader, query);
             float idfBoost = getQueryBoost(reader, query);
             Iterator<Query> iterator = dmq.iterator();
             while(iterator.hasNext())
             {
                 Query next = iterator.next();
                 next.setBoost(next.getBoost() * idfBoost);
             }
             return dmq;
         }
     }


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message