lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Che Dong" <ched...@hotmail.com>
Subject Re: Question: using boost for sorting
Date Thu, 17 Oct 2002 01:23:01 GMT
How about add sortType in IndexSearcher first?
User can speciefy IndexSearcher.sortType(by score:default, by docID, by docID desc) before
indexing.

Che, Dong

diff IndexSearcher.java ~/lucene-1.2-src/src/java/org/apache/lucene/search/IndexSearcher.java

66,81c66
< /**
<  * Implements search over a single IndexReader.
<  *
<  * user can customize search result sort behavior via <code>sortType</code>:
<  * if data source sorted by some field before indexing docID can be take
<  * as the alias to the sort field, so
<  * search result sort by docID(or desc) equals to sort by field
<  *
<  * search results sort method:
<  *  0:  sort by score (default)
<  *  1:  sort by docID
<  *  -1: sort by docID desc
<  *
<  * @author Che, Dong <chedong@bigfoot.com>
<  * $Header: /home/cvsroot/lucene_ext/src/org/apache/lucene/search/IndexSearcher.java,v
1.1.1.1 2002/09/22 19:36:08 chedong Exp $
<  */
---
> /** Implements search over a single IndexReader. */
83,89d67
<   /**
< 
<    */
<   public static final int ORDER_BY_SCORE = 0;
<   public static final int ORDER_BY_DOCID = 1;
<   public static final int ORDER_BY_DOCID_DESC = -1;
<   public int sortType = ORDER_BY_SCORE;
96c74
< 
---
>     
101c79
< 
---
>     
106c84
< 
---
>     
134,162c112,127
<     final int md = reader.maxDoc();
< 
<     scorer.score(new HitCollector()
<       {
<               private float minScore = 0.0f;
<               public final void collect(int doc, float score) {
<                 if (score > 0.0f &&                     // ignore zeroed buckets
<                     (bits==null || bits.get(doc))) {    // skip docs not in bits
<                   totalHits[0]++;
<                   if (score >= minScore) {
<                     // update hit queue
<                     switch (sortType) {
<                           case ORDER_BY_SCORE:   //sort results by score
<                             hq.put(new ScoreDoc(doc, score));
<                           case ORDER_BY_DOCID:   //sort results by docID
<                             hq.put(new ScoreDoc(doc, doc));
<                           case ORDER_BY_DOCID_DESC:  //sort results by docID desc
<                             hq.put(new ScoreDoc(doc, (md - doc) ) );
<                           default:  //sort results by score(default)
<                             hq.put(new ScoreDoc(doc, score));
<                         }
<                     if (hq.size() > nDocs) {            // if hit queue overfull
<                               hq.pop();                         // remove lowest in hit
queue
<                               minScore = ((ScoreDoc)hq.top()).score; // reset minScore
<                     }
<                   }
<                 }
<               }
<       }, md);
---
>     scorer.score(new HitCollector() {
>       private float minScore = 0.0f;
>       public final void collect(int doc, float score) {
>         if (score > 0.0f &&                     // ignore zeroed buckets
>             (bits==null || bits.get(doc))) {    // skip docs not in bits
>           totalHits[0]++;
>           if (score >= minScore) {
>             hq.put(new ScoreDoc(doc, score));   // update hit queue
>             if (hq.size() > nDocs) {            // if hit queue overfull
>               hq.pop();                         // remove lowest in hit queue
>               minScore = ((ScoreDoc)hq.top()).score; // reset minScore
>             }
>           }
>         }
>       }
>       }, reader.maxDoc());
167c132
< 
---
>     


----- Original Message ----- 
From: "Doug Cutting" <cutting@lucene.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Thursday, October 17, 2002 5:21 AM
Subject: Re: Question: using boost for sorting


> Please submit diffs before committing anything, as this is delicate 
> code.  Small changes here can affect performance in a big way.
> 
> Also, we must be extra-careful when making a new public API: once a 
> method is public it's very hard to remove it.  The Similarity methods 
> also need to be well documented.
> 
> Doug
> 
> Otis Gospodnetic wrote:
> > This sounds good to me, as it would lead us to pluggable similarity
> > computation...mmmm.
> > I can refactor some of this tonight.
> > 
> > Otis
> > 
> > 
> > --- Doug Cutting <cutting@lucene.com> wrote:
> > 
> >>This looks like a good approach.  When I get a chance, I'd like to
> >>make 
> >>Similarity an interface or an abstract class, whose default 
> >>implementation would do what the current class does, but whose
> >>methods 
> >>can be overridden.  Then I'd add methods like:
> >>
> >>   public static void Similarity.setDefaultSimilarity(Similarity
> >>sim);
> >>   public void IndexWriter.setSimilarity(Similarity sim);
> >>   public void Searcher.setSimilarity(Similarity sim);
> >>
> >>So to override Similarity methods you'd define a subclass of the 
> >>standard implementation, then either install yours globally via 
> >>setDefaultSimilarity, or set it in your IndexWriter before adding 
> >>documents and in your Searcher before searching.  Does that sound 
> >>reasonable?
> >>
> >>This would let you do what you describe below without changing
> >>Lucene's 
> >>sources.  However I'm very short on time right now and don't know how
> >>
> >>soon I'll get to this.
> >>
> >>Doug
> >>
> >>David Birtwell wrote:
> >>
> >>>Hi Dmitry,
> >>>
> >>>I was faced with a similar problem.  We wanted to have a numeric
> >>
> >>rank 
> >>
> >>>field in each document influence the order in which the documents
> >>
> >>were 
> >>
> >>>returned by lucene.  While investigating a solution for this, I
> >>
> >>wanted 
> >>
> >>>to see if I could implement strict sorting based on this numeric
> >>
> >>value. 
> >>
> >>>I was able to accomplish this using document boosting, but not
> >>
> >>without 
> >>
> >>>modifying the lucene source.  Our "ranking" field is an integer
> >>
> >>value 
> >>
> >>>from one to one hundred.  I'm not sure if this will help you, but
> >>
> >>I'll 
> >>
> >>>include a summary of what I did.
> >>>
> >>>In DocumentWriter remove the normalization by field length:
> >>>   float norm = fieldBoosts[n] * 
> >>>Similarity.normalizeLength(fieldLengths[n]);
> >>>to
> >>>   float norm = fieldBoosts[n];
> >>>
> >>>In TermScorer and PhraseScorer, modify the score() method to ignore
> >>
> >>the 
> >>
> >>>lucene base score:
> >>>   score *= Similarity.decodeNorm(norms[d]);
> >>>to
> >>>   score = Similarity.decodeNorm(norms[d]);
> >>>
> >>>In Similarity.java, make byteToFloat() public.
> >>>
> >>>At index time, use Similarity.byteToFloat() to determine your boost
> >>
> >>>value as in the following pseudocode:
> >>>   Document d = new Document();
> >>>   ... add your fields ...
> >>>   int rank = d.getField("RANK"); (range of rank can be 0 to 255)
> >>>   float sortVal = Similarity.byteToFloat(rank)
> >>>   d.setBoost(sortVal)
> >>>
> >>>If you'd like the reasoning behind any or all of these items, let
> >>
> >>me know.
> >>
> >>>DaveB
> >>>
> >>>
> >>>
> >>>Dmitry Serebrennikov wrote:
> >>>
> >>>
> >>>>Greetings Everyone,
> >>>>
> >>>>I'm thinking of trying to build something that manipulates a query
> >>>
> >>>>score in order to achieve a sort order other then the default 
> >>>>relevance sort. The idea is to create a new type of query:
> >>>>SortingQuery( Query query, String sortByField )
> >>>>
> >>>>It would run the sub-query and return results in an order of the 
> >>>>values found in the "sortByField" for those documents. Now, I've 
> >>>>looked at all of the sorting discussion prior to this, and the
> >>>
> >>best 
> >>
> >>>>approach (recommended by Doug among others) is to provide some
> >>>
> >>sort of 
> >>
> >>>>a fast access to the field values inside the HitCollector. Reading
> >>>
> >>>>documents at search time is too slow, so people access the data 
> >>>>elsewhere or build an in-memory index of that data (such as is
> >>>
> >>done in 
> >>
> >>>>the SearchBean's SortField).
> >>>>
> >>>>My idea is different. I want to try to do the following:
> >>>>- compose a query that consists of the original sub-query followed
> >>>
> >>by 
> >>
> >>>>a special "sorting query"
> >>>>- "boost" the score of the original sub-query to 0
> >>>>- compute the score of the sorting query such that it would
> >>>
> >>reflect 
> >>
> >>>>the desired sort order
> >>>>
> >>>>Has anyone tried to do something like this?
> >>>>Would this work?
> >>>>Is this worth doing?
> >>>>If it would, would then I have to do something during the indexing
> >>>
> >>>>time to set normalization / scoring factors for that field to 
> >>>>something or other?
> >>>>
> >>>>Thanks.
> >>>>Dmitry.
> >>>>
> >>>>
> >>>>
> >>>>-- 
> >>>>To unsubscribe, e-mail:   
> >>>><mailto:lucene-user-unsubscribe@jakarta.apache.org>
> >>>>For additional commands, e-mail: 
> >>>><mailto:lucene-user-help@jakarta.apache.org>
> >>>>
> >>>>
> >>>
> >>>
> >>>-- 
> >>>To unsubscribe, e-mail:   
> >>><mailto:lucene-user-unsubscribe@jakarta.apache.org>
> >>>For additional commands, e-mail: 
> >>><mailto:lucene-user-help@jakarta.apache.org>
> >>>
> >>
> >>
> >>--
> >>To unsubscribe, e-mail:  
> >><mailto:lucene-user-unsubscribe@jakarta.apache.org>
> >>For additional commands, e-mail:
> >><mailto:lucene-user-help@jakarta.apache.org>
> >>
> > 
> > 
> > __________________________________________________
> > Do you Yahoo!?
> > Faith Hill - Exclusive Performances, Videos & More
> > http://faith.yahoo.com
> > 
> > --
> > To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
> > 
> 
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
> 
Mime
View raw message