lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: Consider only documents of a category for IDF
Date Mon, 18 Oct 2010 12:32:18 GMT
Can you not just call reader.docFreq(categoryTerm) ?

The returned figure includes deleted docs but then the search term uses this 
method too so should suffer from the same inaccuracy.

Cheers
Mark



----- Original Message ----
From: Max Jakob <max.jakob@fu-berlin.de>
To: java-user@lucene.apache.org
Sent: Mon, 18 October, 2010 12:26:33
Subject: Consider only documents of a category for IDF

Hi,

I would like to change the IDF value of the Lucene similarity
computation to "inverse document frequency inside category". Not the
complete collection should be considered, but only the documents that
have a certain category. The categories are stored as separate fields.

The implementation below works, but it is kind of slow. I was
wondering if there is a more efficient way than to read the DocIdSet
from the index for each term.

Thanks in advance for any pointers you might have!
Regards,
Max

public class InCategorySimilarity extends DefaultSimilarity {

   public InCategorySimilarity() {}

   // These objects have to be here so that they are visible across
multiple executions of idfExplain
   OpenBitSet categoryIdSet;
   long catDocs = 1;

   @Override
   public Explanation.IDFExplanation idfExplain(final Term term,
final Searcher searcher) throws IOException {
       return new Explanation.IDFExplanation() {
           long termCategoryFreq = 0;
           boolean isCategoryField = term.field().equals("CATEGORY");

           private long termCategoryFreq() {
               try {
                   IndexReader reader = ((IndexSearcher)
searcher).getIndexReader();
                   TermsFilter filter = new TermsFilter();
                   filter.addTerm(term);
                   OpenBitSet docSet = (OpenBitSet) filter.getDocIdSet(reader);

                   if (isCategoryField) {
                       categoryIdSet = docSet;
                       catDocs = categoryIdSet.cardinality();
                   } else {
                       docSet.and(categoryIdSet);
                   }
                  termCategoryFreq = docSet.cardinality();
               } catch (IOException e) {
                   //handle
               }
               return termCategoryFreq;
           }

           public float invCatFreq(long termCategoryFreq, long catDocs) {
               return termCategoryFreq==0 ? 0 : (float) (Math.log(new
Float(catDocs) / new Float(termCategoryFreq)) + 1.0);
           }

           @Override
           public float getIdf() {
               termCategoryFreq = termCategoryFreq();
               float invCatFreq = invCatFreq(termCategoryFreq, catDocs);
               return invCatFreq;
           }
       };
   }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message