mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philippe Lamarche" <philippe.lamar...@gmail.com>
Subject Re: [jira] Updated: (MAHOUT-99) Improving speed of KMeans
Date Fri, 28 Nov 2008 17:17:40 GMT
Hi,

I just tried the patch and I have problem getting it right. I noticed
that there
are 2 new attributes to KmeansDriver.runJob, and I am probably not setting
them right. If I understand correctly, they seem to set the number of mapper
and reducer. How should I set them if I am running mahout on a one nodecluster?

This is what I am getting from the syntheticcontrol example :


hadoop@philippe-vaio:/usr/local/hadoop$ bin/hadoop dfs -put
/home/philippe/synthetic_control.data testdata
hadoop@philippe-vaio:/usr/local/hadoop$ bin/hadoop jar
/home/philippe/workspace/MahoutJava/examples/build/apache-mahout-examples-0.1-dev.job
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
08/11/28 12:01:04 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
08/11/28 12:01:05 INFO mapred.FileInputFormat: Total input paths to process
: 1
08/11/28 12:01:05 INFO mapred.JobClient: Running job: job_200811281146_0008
08/11/28 12:01:06 INFO mapred.JobClient:  map 0% reduce 0%
08/11/28 12:01:13 INFO mapred.JobClient:  map 100% reduce 0%
08/11/28 12:01:14 INFO mapred.JobClient: Job complete: job_200811281146_0008
08/11/28 12:01:14 INFO mapred.JobClient: Counters: 7
08/11/28 12:01:14 INFO mapred.JobClient:   File Systems
08/11/28 12:01:14 INFO mapred.JobClient:     HDFS bytes read=291644
08/11/28 12:01:14 INFO mapred.JobClient:     HDFS bytes written=323660
08/11/28 12:01:14 INFO mapred.JobClient:   Job Counters
08/11/28 12:01:14 INFO mapred.JobClient:     Launched map tasks=2
08/11/28 12:01:14 INFO mapred.JobClient:     Data-local map tasks=2
08/11/28 12:01:14 INFO mapred.JobClient:   Map-Reduce Framework
08/11/28 12:01:14 INFO mapred.JobClient:     Map input records=600
08/11/28 12:01:14 INFO mapred.JobClient:     Map input bytes=288374
08/11/28 12:01:14 INFO mapred.JobClient:     Map output records=600
08/11/28 12:01:14 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
08/11/28 12:01:14 INFO mapred.FileInputFormat: Total input paths to process
: 2
08/11/28 12:01:15 INFO mapred.JobClient: Running job: job_200811281146_0009
08/11/28 12:01:16 INFO mapred.JobClient:  map 0% reduce 0%
08/11/28 12:01:21 INFO mapred.JobClient:  map 50% reduce 0%
08/11/28 12:01:23 INFO mapred.JobClient:  map 100% reduce 0%
08/11/28 12:01:27 INFO mapred.JobClient:  map 100% reduce 100%
08/11/28 12:01:28 INFO mapred.JobClient: Job complete: job_200811281146_0009
08/11/28 12:01:28 INFO mapred.JobClient: Counters: 16
08/11/28 12:01:28 INFO mapred.JobClient:   File Systems
08/11/28 12:01:28 INFO mapred.JobClient:     HDFS bytes read=323660
08/11/28 12:01:28 INFO mapred.JobClient:     HDFS bytes written=9657
08/11/28 12:01:28 INFO mapred.JobClient:     Local bytes read=36119
08/11/28 12:01:28 INFO mapred.JobClient:     Local bytes written=72300
08/11/28 12:01:28 INFO mapred.JobClient:   Job Counters
08/11/28 12:01:28 INFO mapred.JobClient:     Launched reduce tasks=1
08/11/28 12:01:28 INFO mapred.JobClient:     Launched map tasks=2
08/11/28 12:01:28 INFO mapred.JobClient:     Data-local map tasks=2
08/11/28 12:01:28 INFO mapred.JobClient:   Map-Reduce Framework
08/11/28 12:01:28 INFO mapred.JobClient:     Reduce input groups=1
08/11/28 12:01:28 INFO mapred.JobClient:     Combine output records=28
08/11/28 12:01:28 INFO mapred.JobClient:     Map input records=600
08/11/28 12:01:28 INFO mapred.JobClient:     Reduce output records=7
08/11/28 12:01:28 INFO mapred.JobClient:     Map output bytes=943020
08/11/28 12:01:28 INFO mapred.JobClient:     Map input bytes=323660
08/11/28 12:01:28 INFO mapred.JobClient:     Combine input records=1732
08/11/28 12:01:28 INFO mapred.JobClient:     Map output records=1732
08/11/28 12:01:28 INFO mapred.JobClient:     Reduce input records=28
08/11/28 12:01:28 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
08/11/28 12:01:28 INFO mapred.FileInputFormat: Total input paths to process
: 2
08/11/28 12:01:29 INFO mapred.JobClient: Running job: job_200811281146_0010
08/11/28 12:01:30 INFO mapred.JobClient:  map 0% reduce 0%
08/11/28 12:01:35 INFO mapred.JobClient:  map 50% reduce 0%
08/11/28 12:01:37 INFO mapred.JobClient:  map 100% reduce 0%
08/11/28 12:01:41 INFO mapred.JobClient:  map 100% reduce 100%
08/11/28 12:01:42 INFO mapred.JobClient: Job complete: job_200811281146_0010
08/11/28 12:01:42 INFO mapred.JobClient: Counters: 16
08/11/28 12:01:42 INFO mapred.JobClient:   File Systems
08/11/28 12:01:42 INFO mapred.JobClient:     HDFS bytes read=342974
08/11/28 12:01:42 INFO mapred.JobClient:     HDFS bytes written=3002539
08/11/28 12:01:42 INFO mapred.JobClient:     Local bytes read=3018455
08/11/28 12:01:42 INFO mapred.JobClient:     Local bytes written=6036972
08/11/28 12:01:42 INFO mapred.JobClient:   Job Counters
08/11/28 12:01:42 INFO mapred.JobClient:     Launched reduce tasks=1
08/11/28 12:01:42 INFO mapred.JobClient:     Launched map tasks=2
08/11/28 12:01:42 INFO mapred.JobClient:     Data-local map tasks=2
08/11/28 12:01:42 INFO mapred.JobClient:   Map-Reduce Framework
08/11/28 12:01:42 INFO mapred.JobClient:     Reduce input groups=7
08/11/28 12:01:42 INFO mapred.JobClient:     Combine output records=0
08/11/28 12:01:42 INFO mapred.JobClient:     Map input records=600
08/11/28 12:01:42 INFO mapred.JobClient:     Reduce output records=1591
08/11/28 12:01:42 INFO mapred.JobClient:     Map output bytes=3008903
08/11/28 12:01:42 INFO mapred.JobClient:     Map input bytes=323660
08/11/28 12:01:42 INFO mapred.JobClient:     Combine input records=0
08/11/28 12:01:42 INFO mapred.JobClient:     Map output records=1591
08/11/28 12:01:42 INFO mapred.JobClient:     Reduce input records=1591
08/11/28 12:01:42 INFO kmeans.KMeansDriver: Iteration 0
08/11/28 12:01:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
08/11/28 12:01:42 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
08/11/28 12:01:42 INFO mapred.FileInputFormat: Total input paths to process
: 2
08/11/28 12:01:42 INFO mapred.JobClient: Running job: job_local_0001
08/11/28 12:01:42 INFO mapred.FileInputFormat: Total input paths to process
: 2
08/11/28 12:01:42 INFO mapred.MapTask: numReduceTasks: 1
08/11/28 12:01:42 INFO mapred.MapTask: io.sort.mb = 100
08/11/28 12:01:42 INFO mapred.MapTask: data buffer = 79691776/99614720
08/11/28 12:01:42 INFO mapred.MapTask: record buffer = 262144/327680
08/11/28 12:01:42 WARN mapred.LocalJobRunner: job_local_0001
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.substring(String.java:1938)
    at
org.apache.mahout.clustering.kmeans.Cluster.decodeCluster(Cluster.java:81)
    at
org.apache.mahout.clustering.kmeans.KMeansUtil.configureWithClusterInfo(KMeansUtil.java:64)
    at
org.apache.mahout.clustering.kmeans.KMeansMapper.configure(KMeansMapper.java:66)
    at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
    at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
    at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
    at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
08/11/28 12:01:43 WARN kmeans.KMeansDriver: java.io.IOException: Job failed!
java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
    at
org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:129)
    at
org.apache.mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:80)
    at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:80)
    at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:44)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
    at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
08/11/28 12:01:43 INFO kmeans.KMeansDriver: Clustering
08/11/28 12:01:43 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
08/11/28 12:01:43 INFO mapred.FileInputFormat: Total input paths to process
: 2
08/11/28 12:01:44 INFO mapred.JobClient: Running job: job_200811281146_0011
08/11/28 12:01:45 INFO mapred.JobClient:  map 0% reduce 0%


Thanks!
Philippe.

On Fri, Nov 28, 2008 at 8:05 AM, Pallavi Palleti (JIRA) <jira@apache.org>wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Pallavi Palleti updated MAHOUT-99:
> ----------------------------------
>
>     Attachment: MAHOUT-99.patch
>
> this patch takes care of issues with speed. Also, the issues with combiner
> runs zero or more than once has been taken care.
>
> > Improving speed of KMeans
> > -------------------------
> >
> >                 Key: MAHOUT-99
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Clustering
> >            Reporter: Pallavi Palleti
> >         Attachments: MAHOUT-99.patch
> >
> >
> > Improved the speed of KMeans by passing only cluster ID from mapper to
> reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> > Also removed the implicit assumption of Combiner runs only once approach
> and the code is modified accordingly so that it won't create a bug when
> combiner runs zero or more than once.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message