mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1147) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix
Date Mon, 10 Jun 2013 20:20:22 GMT


Grant Ingersoll commented on MAHOUT-1147:

Do you see:
    echo "Extracting Reuters"
    $MAHOUT org.apache.lucene.benchmark.utils.ExtractReuters ${WORK_DIR}/reuters-sgm ${WORK_DIR}/reuters-out
    if [ "$HADOOP_HOME" != "" ] && [ "$MAHOUT_LOCAL" == "" ] ; then
        echo "Copying Reuters data to Hadoop"
        set +e
        $HADOOP dfs -rmr ${WORK_DIR}/reuters-sgm
        $HADOOP dfs -rmr ${WORK_DIR}/reuters-out
        set -e
        $HADOOP dfs -put ${WORK_DIR}/reuters-sgm ${WORK_DIR}/reuters-sgm
        $HADOOP dfs -put ${WORK_DIR}/reuters-out ${WORK_DIR}/reuters-out

Also, I'm on #mahout on IRC if that helps us resolve this faster.
> CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix
> -----------------------------------------------------------------------------------
>                 Key: MAHOUT-1147
>                 URL:
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>         Environment: Eclipse IDE
> Java code base
> CVB0Driver Class
> setModelPaths(Job job, Path modelPath) - method
>            Reporter: Jack Pay
>            Assignee: Jake Mannix
>              Labels: bug, cvb, fix, suggestion
>             Fix For: 0.8
>         Attachments: MAHOUT-1147.patch, MAHOUT-1147.patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
> Problem:
> When training doc/topic model no paths for the term/topic model found (outputs null).
> These paths are set using setModelPaths in CVB0Driver.
> Reason for Problem:
> Variety of Job instances call this method. 
> The Job is passed to the method instead of the Configuration object given to the Job.
> The configuration is retrieved from the Job instance itself.
> I believe that this Configuration instance is a clone of the original.
> This is a problem as the variable MODEL_PATHS is set on the clone which is then discarded
when the given Job is complete.
> The original Configuration has no MODEL_PATHS String set and therefore returns null.
> The code stipulates that if it cannot find a model to use a new random matrix. This happens
every time as MODEL_PATHS is not set for the Configuration instance used.
> Solution:
> Do not pass the Job to the setModels method, but pass the Configuration instance passed
into the method which created the Job.
> i.e.
> change from:
> setModelPaths(Job job, Path modelPath)
> to:
> setModelPaths(Configuration conf, Path modelPath)
> And change all calling methods accordingly (obviously).
> So far what little testing I have done appears to solve this problem.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message