mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nfantone <>
Subject Re: Clustering from DB
Date Mon, 06 Jul 2009 18:31:18 GMT
Fellows, today I updated to revision 791558 and while running kMeans I
got the following exception:

output/clusters-0/part-00000/* does not exist. File output/clusters-0/part-00000/*
does not exist.

The algorithm isn't interrupted, though. But this exception wasn't
thrown before the update and, to me, its message is not quite clear.
It seems as it's looking for any file inside a "part-00000" directory,
which doesn't exist; and, as far as I know, "part-xxxxx" are default
names for output files.

I could show the entire stack trace, if needed. Any pointers?

On Thu, Jul 2, 2009 at 3:16 PM, nfantone<> wrote:
> Thanks for the feedback, Jeff.
>> The logical format of input to KMeans is <Key, Vector> as it is in sequence
>> file format, but the Key is never used. To my knowledge, there is no
>> requirement to assign identifiers to the input points*. Users are free to
>> associate an arbitrary name field with each vector - also label mappings may
>> be assigned - but these are not manipulated by KMeans or any of the other
>> clustering applications. The name field is now used as a vector identifier
>> by the KMeansClusterMapper - if it is non-null - in the output step only.
> The key may not be used internally, but externally they can prove to
> be pretty useful. For me, keys are userIDs and each Vector represents
> his/her historical behavior. Being able to collect the output
> information as <UserID, ClusterID> is quite neat as it allows me to,
> for instance, retrieve user information using data directly from a
> HDFS file's field.

View raw message