lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Document Similarities lucene(particularly using doc id's)
Date Mon, 20 Aug 2007 10:47:27 GMT
You would have write it, as it doesn't exist in Lucene (but could be  
a useful contribution).  The easiest version is probably the cosine  
similarity, described at http://en.wikipedia.org/wiki/Vector_space_model

Essentially, you have two vectors, and you need to calculate the  
cosine of the angle between them.  Then you can do this for each one  
in your set.

What is the bigger goal you are trying to get at?  What do you need  
the similarity score for?  Do you need to compare every item in set 1  
against every item in set 2?

On Aug 19, 2007, at 11:19 PM, Lokeya wrote:

>
> Hi,
>
> Thanks for your reply.
>
> I can use the getTermFreqVector() on Index Reader and get it. But I am
> wondering whats the API which has to be used to find the similarity  
> between
> 2 such vectors which would give a score (doc-doc similairty in   
> essence).
>
> Thanks.
>
>
>
> Grant Ingersoll-6 wrote:
>>
>> Hi,
>>
>>
>> On Aug 16, 2007, at 2:20 PM, Lokeya wrote:
>>
>>>
>>> Hi All,
>>>
>>> I have the following set up: a) Indexed set of docs. b) Ran 1st
>>> query and
>>> got tops docs  c) Fetched the id's from that and stored in a data
>>> structure.
>>> d) Ran 2nd query , got top docs , fetched id's and stored in a data
>>> structure.
>>>
>>> Now i have 2 sets of doc ids (set 1) and (set 1).
>>>
>>> I want to find out the document content similarity between these 2
>>> sets(just
>>> using doc ids information which i have).
>>>
>>
>> Not sure what you mean here.  What do the doc ids have to do with the
>> content?
>>
>>> Qn 1: Is it possible using any lucene api's. In that case can you
>>> point me
>>> to the appropriate API's. I did a search at
>>> :http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/
>>> javadoc/index.html
>>> But couldn't find anything.
>>>
>>
>> It is possible if you use Term Vectors (see
>> IndexReader.getTermFreqVector).  You will need to store (when you
>> construct your Field) and load the term vectors and then calculate
>> the similarity.  A common way of doing this is by calculating the
>> cosine of the angle between the two vectors.
>>
>> -Grant
>>
>> --------------------------
>> Grant Ingersoll
>> http://lucene.grantingersoll.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Document- 
> Similarities-lucene%28particularly-using-doc-id%27s%29- 
> tf4281286.html#a12229492
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message