mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eric skinner <ericfrankskin...@gmail.com>
Subject reason for getting clustering result like 0 belongs to cluster 1.0: [ ]
Date Tue, 09 Aug 2011 17:50:31 GMT
I ran the NewsKMeansClustering.java(an example given in chapter 9 of
Mahout-in-Action) against a set of sequence files. However, the generated
result looks like this?

0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []

The core clustering code in this program is

*CanopyDriver.run(vectorsFolder, canopyCentroids, new
EuclideanDistanceMeasure(), 250, 120, false, false);
KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
"clusters-0"), clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true,
false);*

Would you like to let me know why I get this type of result? Is that because
of any specific parameter setting requirement or anything else?

The whole program is included for reference

public static void main(String args[]) throws Exception {

    int minSupport = 5;
    int minDf = 5;
    int maxDFPercent = 95;
    int maxNGramSize = 2;
    int minLLRValue = 50;
    int reduceTasks = 1;
    int chunkSize = 200;
    int norm = 2;
    boolean sequentialAccessOutput = true;

  //  String inputDir = "inputDir";

    String inputDir = "sequenceInputDir";

    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);
    /*
     * SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, new
Path(inputDir, "documents.seq"),
     * Text.class, Text.class); for (Document d : Database) {
writer.append(new Text(d.getID()), new
     * Text(d.contents())); } writer.close();
     */

    String outputDir = "newsClusters";
    HadoopUtil.delete(conf, new Path(outputDir));
    Path tokenizedPath = new Path(outputDir,
        DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
    MyAnalyzer analyzer = new MyAnalyzer();
    DocumentProcessor.tokenizeDocuments(new Path(inputDir),
analyzer.getClass()
        .asSubclass(Analyzer.class), tokenizedPath, conf);

    DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
      new Path(outputDir), conf, minSupport, maxNGramSize, minLLRValue, 2,
true, reduceTasks,
      chunkSize, sequentialAccessOutput, false);
    TFIDFConverter.processTfIdf(
      new Path(outputDir ,
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
      new Path(outputDir), conf, chunkSize, minDf,
      maxDFPercent, norm, true, sequentialAccessOutput, false, reduceTasks);
    Path vectorsFolder = new Path(outputDir, "tfidf-vectors");
    Path canopyCentroids = new Path(outputDir , "canopy-centroids");
    Path clusterOutput = new Path(outputDir , "clusters");

    CanopyDriver.run(vectorsFolder, canopyCentroids,
      new EuclideanDistanceMeasure(), 250, 120, false, false);
    KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
"clusters-0"),
      clusterOutput, new TanimotoDistanceMeasure(), 0.01,
      20, true, false);

    SequenceFile.Reader reader = new SequenceFile.Reader(fs,
   // new Path(clusterOutput + Cluster.CLUSTERED_POINTS_DIR +
"/part-m-00000"), conf);
   new Path(clusterOutput+"/clusteredPoints"+"/part-m-00000"),conf);

    IntWritable key = new IntWritable();
    WeightedVectorWritable value = new WeightedVectorWritable();
    while (reader.next(key, value)) {
       System.out.println(key.toString() + " belongs to cluster "
       + value.toString());
    }
    reader.close();
  }
}

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message