mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chenghao liu <twins...@gmail.com>
Subject Re: Problem running new LDA algorithm (cvb) against the Reuters data
Date Sat, 05 May 2012 06:15:52 GMT
you can modify thc code CachingCVB0Mapper.map
CachingCVB0PerplexityMapper.map CVB0DocInferenceMapper.map with
SequenceFile<WritableComparable<?>,VectorWritable> instead,and then
convert the type of key to Integer

2012/5/5 chenghao liu <twinsken@gmail.com>:
> the title of doc which is the key of sequenceFile need to be number
>
> 2012/5/5 Jake Mannix <jake.mannix@gmail.com>:
>> I'm about to head to bed right now (long day, flight to and from sf in one
>> day, need sleep), but short answer is
>> that the new LDA requires SequenceFile<IntWritable, VectorWritable> as
>> input (the same disk format
>> as DistributedRowMatrix), which you can get out of SequenceFile<Text,
>> VectorWritable> by running the
>> RowIdJob ("$MAHOUT_HOME/bin/mahout rowid -h" for more details) before
>> running CVB.
>>
>> Let us know if that doesn't help!
>>
>> On Fri, May 4, 2012 at 8:54 PM, DAN HELM <danielhelm@verizon.net> wrote:
>>
>>> I am attempting to run the new LDA algorithm cvb (Mahout version 0.6)
>>> against the Reuters data.   I just added another
>>> entry to the cluster-reuters.sh example script as follows:
>>>
>>> ******************************************************************************
>>> elif [ "x$clustertype" == "xcvb" ]; then
>>>   $MAHOUT seq2sparse \
>>>     -i ${WORK_DIR}/reuters-out-seqdir/ \
>>>     -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \
>>>     -wt tf -seq -nr 3 --namedVector \
>>>   && \
>>>   $MAHOUT cvb \
>>>     -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \
>>>     -o ${WORK_DIR}/reuters-cvb -k 20 -ow -x 2 \
>>>     -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>>>     -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb \
>>>   && \
>>>   $MAHOUT ldatopics \
>>>     -i ${WORK_DIR}/reuters-cvb/state-2 \
>>>     -d ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>>>     -dt sequencefile
>>>
>>> ******************************************************************************
>>> I successfully ran the previous LDA algorithm against Reuters but I am
>>> most interested in this new implementation of LDA because I want the new
>>> feature that generates document-to-cluster mappings (e.g., parameter –dt).
>>>
>>> When I run the above code via Hadoop pseudo distributed mode as well as on
>>> a small cluster I receive the same error from the "mahout cvb" command.
>>> All the pre-clustering logic including sequence file and sparse vector
>>> generation works fine but when the cvb clustering is attempted the mappers
>>> fail with the following error in the Hadoop map task log:
>>>
>>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>>> org.apache.hadoop.io.IntWritable
>>>  at
>>> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
>>>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>
>>> Any help with resolving the problem would be appreciated.
>>>
>>> Dan
>>
>>
>>
>>
>> --
>>
>>  -jake

Mime
View raw message