mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jake Mannix (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1147) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix
Date Mon, 10 Jun 2013 20:12:20 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679843#comment-13679843
] 

Jake Mannix commented on MAHOUT-1147:
-------------------------------------

Hmmm:

13/06/10 12:58:44 INFO cvb.CVB0Driver: About to run: Writing final topic/term distributions
from /tmp/mahout-work-jake/reuters-lda-model/model-20 to /tmp/mahout-work-jake/reuters-lda
13/06/10 12:58:45 INFO input.FileInputFormat: Total input paths to process : 10
13/06/10 12:58:46 INFO cvb.CVB0Driver: About to run: Writing final document/topic inference
from /tmp/mahout-work-jake/reuters-out-matrix/matrix to /tmp/mahout-work-jake/reuters-lda-topics
13/06/10 12:58:47 INFO input.FileInputFormat: Total input paths to process : 1
13/06/10 12:58:52 INFO mapred.JobClient: Running job: job_201306101136_0057
13/06/10 12:58:53 INFO mapred.JobClient:  map 0% reduce 0%
13/06/10 12:59:50 INFO mapred.JobClient:  map 20% reduce 0%
13/06/10 12:59:56 INFO mapred.JobClient:  map 40% reduce 0%
13/06/10 12:59:59 INFO mapred.JobClient:  map 60% reduce 0%
13/06/10 13:00:02 INFO mapred.JobClient:  map 80% reduce 0%
13/06/10 13:00:05 INFO mapred.JobClient:  map 100% reduce 0%
13/06/10 13:00:08 INFO mapred.JobClient: Job complete: job_201306101136_0057
13/06/10 13:00:08 INFO mapred.JobClient: Counters: 6
13/06/10 13:00:08 INFO mapred.JobClient:   Job Counters 
13/06/10 13:00:08 INFO mapred.JobClient:     Launched map tasks=10
13/06/10 13:00:08 INFO mapred.JobClient:     Data-local map tasks=10
13/06/10 13:00:08 INFO mapred.JobClient:   FileSystemCounters
13/06/10 13:00:08 INFO mapred.JobClient:     HDFS_BYTES_READ=6690610
13/06/10 13:00:08 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=6690610
13/06/10 13:00:08 INFO mapred.JobClient:   Map-Reduce Framework
13/06/10 13:00:08 INFO mapred.JobClient:     Map input records=20
13/06/10 13:00:08 INFO mapred.JobClient:     Spilled Records=0
13/06/10 13:00:08 INFO mapred.JobClient: Running job: job_201306101136_0058
13/06/10 13:00:09 INFO mapred.JobClient:  map 0% reduce 0%
13/06/10 13:00:12 INFO mapred.JobClient:  map 100% reduce 0%
13/06/10 13:10:17 INFO mapred.JobClient: Task Id : attempt_201306101136_0058_m_000000_0, Status
: FAILED
java.lang.NullPointerException
	at org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.cleanup(CVB0DocInferenceMapper.java:99)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)

Task attempt_201306101136_0058_m_000000_0 failed to report status for 602 seconds. Killing!
13/06/10 13:10:18 INFO mapred.JobClient:  map 0% reduce 0%
13/06/10 13:10:27 INFO mapred.JobClient:  map 100% reduce 0%

                
> CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix
> -----------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1147
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1147
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>         Environment: Eclipse IDE
> Java code base
> CVB0Driver Class
> setModelPaths(Job job, Path modelPath) - method
>            Reporter: Jack Pay
>            Assignee: Jake Mannix
>              Labels: bug, cvb, fix, suggestion
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1147.patch, MAHOUT-1147.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem:
> When training doc/topic model no paths for the term/topic model found (outputs null).
> These paths are set using setModelPaths in CVB0Driver.
> Reason for Problem:
> Variety of Job instances call this method. 
> The Job is passed to the method instead of the Configuration object given to the Job.
> The configuration is retrieved from the Job instance itself.
> I believe that this Configuration instance is a clone of the original.
> This is a problem as the variable MODEL_PATHS is set on the clone which is then discarded
when the given Job is complete.
> The original Configuration has no MODEL_PATHS String set and therefore returns null.
> The code stipulates that if it cannot find a model to use a new random matrix. This happens
every time as MODEL_PATHS is not set for the Configuration instance used.
> Solution:
> Do not pass the Job to the setModels method, but pass the Configuration instance passed
into the method which created the Job.
> i.e.
> change from:
> setModelPaths(Job job, Path modelPath)
> to:
> setModelPaths(Configuration conf, Path modelPath)
> And change all calling methods accordingly (obviously).
> So far what little testing I have done appears to solve this problem.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message