mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang <>
Subject mahout kmeans gives a random result for short documents
Date Tue, 21 Oct 2014 21:13:27 GMT
we are trying to run kmeans  on some product titles
so that we could cluster together similar products
like "nike flex sneaker size 9" vs "nike flex sneaker size 8"
it works fine for most
but it turns out that a lot of the titles are very short (particularly
after filtering stopwords)
so I got many 1-word or 2-word titles
and somehow these got lumped together into a huge cluster
which does not have any similarly between the members at all
I followed some specific examples in this cluster,
it seems that the algorithm is indeed doing what it's supposed to do.

anybody has similar experience clustering particularly short "documents" ?
generally any tricks to force the members to "jump" out and join another
cluster ? (I do see other smaller clusters, with matching words)


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message