lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karsten Konrad" <Karsten.Kon...@xtramind.com>
Subject AW: Document Similarity
Date Wed, 03 Dec 2003 22:24:09 GMT

Hi,

>> Do they produce same ranking results? 

No; Lucene's operations on query weight and length normalization is not
equivalent to a vanilla cosine in vector space.

>> I guess the 2nd approach will be more precise but slow.

Query similarity 
will indeed  be faster, but may actually not be worse. A straightforward 
cosine  without IDF weighting of terms (as Lucene does) will almost certainly 
be less precise if you have documents of different length - word
occurence probabilities in texts of different lengths vary greatly,
and the cosine of independent longer texts will often be greater than 
those that actually have the same topic, but are short, just because 
of randomly found non-content words.

If, on the other hand, you choose the right TF/IDF weighting  of 
terms, the cosine in this warped vector space could be (a) 
equivalent to the one Lucene does - requires some work to do so, or 
(b) might even get better on average.

However, the last time I counted, there where about 250 different 
TF/IDF formulas around in IR publications, machine learning,
computational linguistics and so on. Performance depends on domain
and language. 

But if I was you, I just would start playing and have fun with
the stuff...

Karsten


-----Urspr√ľngliche Nachricht-----
Von: Jing Su [mailto:J.Su@cs.bham.ac.uk] 
Gesendet: Dienstag, 2. Dezember 2003 18:12
An: lucene-user@jakarta.apache.org
Betreff: Document Similarity



Hi,

I have read some posts in user/developer archives about Lucene-based document similarity comparison.
In summary there are two approaches are
mentioned:

1 - Construct document to a query;
2 - Calculate each document to be a vector, then rank accoring to their distance (cosine).

Do they produce same ranking results? Is there any other way to do so? I guess the 2nd approach
will be more precise but slow.

Thanks.

Jing

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message