mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nfantone <>
Subject Re: Clustering from DB
Date Thu, 02 Jul 2009 18:16:54 GMT
Thanks for the feedback, Jeff.

> The logical format of input to KMeans is <Key, Vector> as it is in sequence
> file format, but the Key is never used. To my knowledge, there is no
> requirement to assign identifiers to the input points*. Users are free to
> associate an arbitrary name field with each vector - also label mappings may
> be assigned - but these are not manipulated by KMeans or any of the other
> clustering applications. The name field is now used as a vector identifier
> by the KMeansClusterMapper - if it is non-null - in the output step only.

The key may not be used internally, but externally they can prove to
be pretty useful. For me, keys are userIDs and each Vector represents
his/her historical behavior. Being able to collect the output
information as <UserID, ClusterID> is quite neat as it allows me to,
for instance, retrieve user information using data directly from a
HDFS file's field.

View raw message