mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Clustering from DB
Date Thu, 02 Jul 2009 15:34:34 GMT
See inline comments:

nfantone wrote:
> After some research and testing, I believe I can throw some light on
> the subject. The runJob() static method defined in KMeansDriver
> expects three file paths, referencing three different files with
> different logical record's format; moreover, a "points" directory,
> along with other files, are created as part of the output:
>
> 1) input
>
> Description: A file containing data to be clustered, represented by Vectors.
> Path: An absolute path to an HDFS data file.  Example: "input/thedata.dat"
> Logical format: <ID, Vector>. The ID could be anything as long as it
> extends Writable.
>   
The logical format of input to KMeans is <Key, Vector> as it is in 
sequence file format, but the Key is never used. To my knowledge, there 
is no requirement to assign identifiers to the input points*. Users are 
free to associate an arbitrary name field with each vector - also label 
mappings may be assigned - but these are not manipulated by KMeans or 
any of the other clustering applications. The name field is now used as 
a vector identifier by the KMeansClusterMapper - if it is non-null - in 
the output step only.

*MeanShift could certainly benefit from a requirement that all input 
points have unique identifiers. Using the optional name field in this 
manner seems pretty kludgy to me.
> Code example (writing an input file):
>
> // Get FileSystem through Configuration
> Configuration conf = new Configuration();
> Filesystem fs = FileSystem.get(conf);
>
> // Instantiate writer to input data in a .dat file
> // with a <Text, SparseVector> logical format
> String fileName = "input/thedata.dat";
> Path path = new Path(fileName);
>
> SequenceFile.Writer seqVectorWriter = new SequenceFile.Writer(fs,
> conf, path, Text.class, SparseVector.class);
> VectorWriter writer = new SequenceFileVectorWriter(seqVectorWriter);
>
> // Write Vectors to file. inputVectors could be any VectorIterable
> implementation.
> writer.write(inputVectors);
> writer.close();
>
> 2) clustersIn
>
> Description: A file containing the initial pre-computed (or randomly
> selected) clusters to be used by kMeans. The 'k' value is determined
> by the number of clusters in THIS file.
> Path: An absolute path to a DIRECTORY containing any number of files
> with a "part-xxxxx" name format, where 'x' is a one digit number. The
> name should be omitted from the path. Example: "input/initial", where
> 'initial' has a "part-00000" file stored in it.
> Logical format: <ID, ClusterBase>. The ID could be anything as long as
> it extends Writable.
>   
Again, the sequence file format requires an ID but this is not used. 
Each cluster has an internal ID in its state which is used by the 
implementation. Typically, the ID is the same as the internal ID.
> Code example (writing a clustersIn file):
>
> // Get FileSystem through Configuration
> Configuration conf = new Configuration();
> Filesystem fs = FileSystem.get(conf);
>
> // Instantiate writer to input clusters in a file with a <Text,
> Cluster> logical format
> String fileName = "input/initial/part-00000";
> Path path = new Path(fileName);
>
> SequenceFile.Writer seqClusterWriter = new SequenceFile.Writer(fs,
> conf, path Text.class, Cluster.class);
>
> // We choose 'k' random Vectors as centers for the initial clusters.
> // 'inputVectors' could be any VectorIterable implementation.
> // CANT_INITIAL_CLUSTERS is a desired integer value .
> // The identifier of a Cluster is used as its ID.
> // AFAICT, you DO NOT need to add the center as an actual point in the cluster,
> // after cluster creation. This has been corrected recently.
> int k = 0;
> Iterator it = inputVectors.iterator();
> while (it.hasNext() && k++ < CANT_INITIAL_CLUSTERS) {
> 	Vector v = (Vector)it.next();
> 	Cluster c = new Cluster(v);
> 	seqClusterWriter.append(new Text(c.getIdentifier()), c);
> }
> seqClusterWriter.close();
>
> 3) output
>
> Description: The output files generated by the algorithm, in which the
> results are stored. Directories named "clusters-i" -'i' being a
> positive integer- are created. I'm not quite certain, but I believe
> its nomenclature comes from the number of MAP/REDUCED tasks involved.
> "part-00000" files are placed in those directories - they hold records
> logically structured as <Text, Cluster>, each of which represent a
> determined cluster in the dataset.
>   
Each iteration produces a new set of clusters and these are stored in a 
"clusters-i" directory. The number of parts in each file is determined 
by the number of reducers used by the clustering implementation. Only 
KMeans and Dirichlet allow more than one reducer. Dirichlet and 
MeanShift put all these iteration-generated files in a separate state 
directory in the output path. The nomenclature of these directories is 
not standard and I see an improvement is needed.
> Path: An absolute path to a parent directory for the "clusters-i"
> directories. Example: "output".
> Code example (reading and printing an output file):
>
> // Get FileSystem through Configutaion
> Configuration conf = new Configuration();
> Filesystem fs = FileSystem.get(conf);
>
> // Create a reader for a 'part-00000' file
> Path outPath = new Path("output/clusters-0/part-00000");
> SequenceFile.Reader reader  = new SequenceFile.Reader(fs, outPath, conf);
>
> Writable key =  (Writable) reader.getKeyClass().newInstance();
> Cluster value = new Cluster();
> Vector center = null;
>
> // Read file's records and print each cluster as 'Cluster: key {center}'
> while (reader.next(key, value)) {
> 	System.out.println("Cluster: " + key + " { ");
> 	center = value.getCenter();
>
> 	for (int i = 0; i < center.size(); i++) {
> 		System.out.print(center.get(i) + " ");
> 	}
> System.out.println(" }");
>
> 4) points
>
> Description: A directory containing a "part-00000" file with a
> <VectorID, CusterID> (both being Text type fields). It's basically an
> index (with VectorID as key) that matches every Vector described in
> the input ("thedata.dat" in our example) with the cluster they now
> belong.
> Logical format: <VectorID, ClusterID>. VectorID matches the ID
> specified by the first field of each record int the input file.
> ClusterID matches the ID in the first field of each "part-xxxxx"
> included in a "clusters-i" directory.
>   
The output points format has been recently changed from <ClusterID, 
Vector-asFormatString> to output either:
<Vector.name, ClusterID> or <Vector.asFormatString, ClusterId> depending 
upon if the points have been named or not.

The "TODO: This is ugly" comment in the Cluster code used for this 
kludge is spot on.
Jeff

Mime
View raw message