lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Vector Space Model: New Similarity Implementation Issues
Date Thu, 28 Feb 2008 22:08:56 GMT
FYI: The mailing list handler strips attachments.

At any rate, sounds like an interesting project.  I don't know how  
easy it will be for you to implement 7 variants of VSM in Lucene given  
the nature of the APIs, but if you do, it might be handy to see your  
changes as a patch.  :-)  Also not quite sure what all those variants  
will help with when it comes to your broader goal, but that isn't for  
me to decide :-)  Seems like your goal is to find the traceability  
stuff, not see if you can figure out how to change Lucene's  
similarity!  To that end, my two cents would be to focus on creating  
the right kinds of queries, analyzers, etc.


-Grant

On Feb 28, 2008, at 3:55 PM, Dharmalingam wrote:

>
> Thanks for your tips. My overall goal is to quickly implement 7  
> variants of
> vector space model using Lucene. You can find these variants in the
> updloaded file.
>
> I am doing all these stuffs for a much broader goal: I am trying to  
> recover
> traceability links from requirements to source code files. I treat  
> every
> requirement as a query. In this problem, I would like to compare these
> collection of algorithms for their relevance.
>
>
>
>
> Grant Ingersoll-6 wrote:
>>
>>
>> On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:
>>
>>>
>>> Thanks for the reply. Sorry if my explanation is not clear. Yes, you
>>> are
>>> correct the model is based on  Salton's VSM. However, the
>>> calculation of the
>>> term weight and the doc norm is, in my opinion, different from
>>> Lucene. If
>>> you look at the table given in
>>> http://www.miislita.com/term-vector/term-vector-3.html, they
>>> calcuate the
>>> document norm based on the weight wi=tfi*idfi. I looked at the
>>> interfaces of
>>> Similarity and DefaultSimilairty class. I place it below:
>>>
>>> public float lengthNorm(String fieldName, int numTerms) {
>>>   return (float)(1.0 / Math.sqrt(numTerms));
>>> }
>>>
>>> You can see that this lengthNorm for a doc is quite different from
>>> that
>>> website norm calculation.
>>
>> The lengthNorm method is different from the IDF calculation.  In the
>> Similarity class, that is handled by the idf() method.  Length norm  
>> is
>> an attempt to address one of the limitations listed further down in
>> that paper:
>> "Long Documents: Very long documents make similarity measures
>> difficult (vectors with small dot products and high dimensionality)"
>>
>>
>>
>>>
>>>
>>> Similarly, the querynorm interface of DefaultSimilarity class is:
>>>
>>> /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
>>> public float queryNorm(float sumOfSquaredWeights) {
>>>   return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
>>> }
>>>
>>> This is again different the website model.
>>
>> Query norm is an attempt to allow for comparison of scores across
>> queries, but I don't think one should do that anyway.
>>
>>
>>>
>>>
>>> I also have difficulities with tf interface of DefaultSimilarity:
>>> /** Implemented as <code>sqrt(freq)</code>. */
>>> public float tf(float freq) {
>>>   return (float)Math.sqrt(freq);
>>> }
>>>
>>
>> These are all callback methods from within the Scorer classes that
>> each Query uses.  Have a look at TermScorer for how these things get
>> called.
>>
>>
>> Try this as an example:
>>
>> Setup a really simple index with 1 or 2 docs each with a few words.
>> Setup a simple Similarity class where you override all of these
>> methods to return 1 (or some simple default)
>> and then index your documents and do a few queries.
>>
>> Then, have a look at Searcher.explain() to see why a document scores
>> the way it does.  Then, you can work to modify from there.
>>
>> Here's the bigger question:  what is your ultimate goal here?  Are  
>> you
>> just trying to understand Lucene at an academic/programming level or
>> do you have something you are trying to achieve in terms of  
>> relevance?
>>
>> -Grant
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
> http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf
> -- 
> View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message