mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Similarity Measures for Text Document Clustering
Date Sat, 24 May 2014 13:28:43 GMT
I just read this paper and it it very nicely written up.  There are a few
unfortunate omissions:

1) cosine is equivalent to Euclidean with the addition of document and
centroid normalization.

2) the entropy measure given appears to be an ad hoc partial derivation of
mutual information, but this is not mentioned, nor are the differences
examined

3) the tf-idf measure used uses straight tf.  It is usually better to use
log(tf) or sqrt(tf).  This is not examined.

4) the same number of clusters as target categories is used.  Commonly,
clustering is used as a feature for classification and there is no
rationale in that case for the number of clusters to be the same as the
number of target categories.

5) if (4) is accepted, then mutual information is immediately better than
the entropy measure shown since it is normalizes away the number of
clusters.





On Fri, May 23, 2014 at 9:39 PM, David Noel <david.i.noel@gmail.com> wrote:

> I found an interesting paper that I thought someone here might find
> helpful.
>
>
> http://www.milanmirkovic.com/wp-content/uploads/2012/10/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf
>
> ABSTRACT: ... A wide variety of distance functions and similarity
> measures have been used for clustering, such as squared Euclidean
> distance, cosine similarity, and relative entropy. In this paper, we
> compare and analyze the effectiveness of these measures in partitional
> clustering for text document datasets. Our experiments utilize the
> standard K-means algorithm and we report results on seven text
> document datasets and five distance/similarity measures that have been
> most commonly used in text clustering.
>
> TL;DR: For text documents, favor Cosine, Jaccard/Tanimoto, or Pearson
> over Euclidean distance measures.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message