mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrey Davydov (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAHOUT-1128) MAHOUT-999 issue still actual
Date Mon, 17 Dec 2012 15:00:29 GMT
Andrey Davydov created MAHOUT-1128:
--------------------------------------

             Summary:  MAHOUT-999 issue still actual
                 Key: MAHOUT-1128
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1128
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.7
         Environment: I work on Hadoop 1.0.3 cluster deployed on Amazon EC2 virtual computers
with Ubuntu 11 and mahout-core.jar 0.7 from maven-central.
I run my application from separated "clien" machine and it submit tasks to cluster.


            Reporter: Andrey Davydov


I'm sorry my english is not well and I'm newbie with Mahout. But it seems that MAHOUT-999
issue still actual.

I use mahout-core 0.7 loaded from maven-central and I've got the same fail. 

I've investigate sources and found following in the org.apache.mahout.clustering.classify.ClusterClassifier
class:

  public void writeToSeqFiles(Path path) throws IOException {
    writePolicy(policy, path);
    Configuration config = new Configuration();
    FileSystem fs = FileSystem.get(path.toUri(), config);
    SequenceFile.Writer writer = null;
    ClusterWritable cw = new ClusterWritable();
    for (int i = 0; i < models.size(); i++) {
...
      } finally {
        Closeables.closeQuietly(writer);
      }
    }
  }
  
  public void readFromSeqFiles(Configuration conf, Path path) throws IOException {
    Configuration config = new Configuration();
    List<Cluster> clusters = Lists.newArrayList();
    for (ClusterWritable cw : new SequenceFileDirValueIterable<ClusterWritable>(path,
PathType.LIST,
        PathFilters.logsCRCFilter(), config)) {
...
    }
    this.models = clusters;
    modelClass = models.get(0).getClass().getName();
    this.policy = readPolicy(path);
  }

Both methods use new default Configuration and they try to work with local file system. I.e.
KMeansDriver wrote initial clusters to local file system of the "client" system and CIMapper
try to read it from cluster node local file system.

It seems that current implementation can work only pseudo-distributed hadoop system. I think
that ClusterClassifier should store intermediate results in the HDFS using Configuration passed
by api from user.








--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message