mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Word and Phrase Clustering
Date Fri, 02 Dec 2011 04:29:19 GMT
Could you elaborate a bit on what you mean by "cluster a collection of 
words and phrases by syntactic similarity over a distributed environment 
"? If you can describe your collection in terms of a set of (sparse or 
dense) term vectors then you should be able to use Mahout clustering 
directly. The vectors do not need to be huge (as "document" might 
imply), indeed smaller dimensionality clusterings work better than large 
ones. The question would be how do you plan to encode these vectors? 
Another would be how large a collection you have?

On 12/1/11 8:08 PM, Neil Chaudhuri wrote:
> I have a need to cluster a collection of words and phrases by syntactic similarity over
a distributed environment, and I came upon Mahout as a possible solution. After studying the
documentation though, I am finding all of it tailored to working with entire documents rather
than words and phrases. I simply want to know if you believe that Mahout is the right tool
for this job. I suppose I could try to view each word and phrase as individual tiny documents,
but that feels like I am forcing it.
>
> Any insight is appreciated.
>
> Thanks.
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message