lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dharmalingam <dgane...@fc-md.umd.edu>
Subject Re: Vector Space Model: New Similarity Implementation Issues
Date Thu, 28 Feb 2008 14:00:10 GMT

Thanks for the reply. Sorry if my explanation is not clear. Yes, you are
correct the model is based on  Salton's VSM. However, the calculation of the
term weight and the doc norm is, in my opinion, different from Lucene. If
you look at the table given in
http://www.miislita.com/term-vector/term-vector-3.html, they calcuate the
document norm based on the weight wi=tfi*idfi. I looked at the interfaces of
Similarity and DefaultSimilairty class. I place it below:

public float lengthNorm(String fieldName, int numTerms) {
    return (float)(1.0 / Math.sqrt(numTerms));
 }

You can see that this lengthNorm for a doc is quite different from that
website norm calculation.

Similarly, the querynorm interface of DefaultSimilarity class is:

 /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
  public float queryNorm(float sumOfSquaredWeights) {
    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
  }

This is again different the website model.

I also have difficulities with tf interface of DefaultSimilarity: 
/** Implemented as <code>sqrt(freq)</code>. */
  public float tf(float freq) {
    return (float)Math.sqrt(freq);
  }

In that website model, a tf refers to the frequency of a term within a doc.

I hope explained it better. Please let me know if it is unclear. I am
looking for an easy way to implement that table, and of course want to
integrate with my lucene (  i.e., myIndexWriter.setSimilarity(new
mySimilarity());) Will this be possible by just somehow inheriting the base
classes of Lucene.

Thanks for your advice.

Grant Ingersoll-6 wrote:
> 
> Not sure I am understanding what you are asking, but I will give it a  
> shot.   See below
> 
> 
> On Feb 26, 2008, at 3:45 PM, Dharmalingam wrote:
> 
>>
>> Hi List,
>>
>> I am pretty new to Lucene. Certainly, it is very exciting. I need to
>> implement a new Similarity class based on the Term Vector Space  
>> Model given
>> in http://www.miislita.com/term-vector/term-vector-3.html
>>
>> Although that model is similar to Lucene’s model
>> (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html

>> ),
>> I am having hard time to extend the Similarity class to calculate that
>> model.
>>
>> In that model, “tf” is multiplied with Idf for all terms in the  
>> index, but
>> in Lucene “tf” is calculated only for terms in the given Query.  
>> Because of
>> that effect, the norm calculation should also include “idf” for all  
>> terms.
>> Lucene calculates the norm, during indexing, by “just” counting the  
>> number
>> of terms per document. In the web formula (in miislita.com), a  
>> document norm
>> is calculated after multiplying “tf” and “idf”.
> 
> Are you wondering if there is a way to score all documents regardless  
> of whether the document has the term or not?  I don't quite get your  
> statement: "In that model, “tf” is multiplied with Idf for all terms  
> in the index, but in Lucene “tf” is calculated only for terms in the  
> given Query."
> 
> Isn't the result for those documents that don't have query terms just  
> going to be 0 or am I not fully understanding?  I briefly skimmed the  
> paper you cite and it doesn't seem that different, it's just  
> describing the Salton's VSM right?
> 
>>
>>
>> FYI: I could implement “idf” according to miisliat.com formula, but  
>> not the
>> “tf” and “norm”
>>
>> Could you please comment me how I can implement a new Similarity  
>> class that
>> will fit in the Lucene’s architecture, but still implement the  
>> vector space
>> model given in miislita.com
> 
> In the end, you may need to implement some lower level Query classes,  
> but I still don't fully understand what you are trying to do, so I  
> wouldn't head down that path just yet.
> 
> --------------------------
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15736946.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message