you can modify thc code CachingCVB0Mapper.map
CachingCVB0PerplexityMapper.map CVB0DocInferenceMapper.map with
SequenceFile<WritableComparable<?>,VectorWritable> instead,and then
convert the type of key to Integer
2012/5/5 chenghao liu <twinsken@gmail.com>:
> the title of doc which is the key of sequenceFile need to be number
>
> 2012/5/5 Jake Mannix <jake.mannix@gmail.com>:
>> I'm about to head to bed right now (long day, flight to and from sf in one
>> day, need sleep), but short answer is
>> that the new LDA requires SequenceFile<IntWritable, VectorWritable> as
>> input (the same disk format
>> as DistributedRowMatrix), which you can get out of SequenceFile<Text,
>> VectorWritable> by running the
>> RowIdJob ("$MAHOUT_HOME/bin/mahout rowid -h" for more details) before
>> running CVB.
>>
>> Let us know if that doesn't help!
>>
>> On Fri, May 4, 2012 at 8:54 PM, DAN HELM <danielhelm@verizon.net> wrote:
>>
>>> I am attempting to run the new LDA algorithm cvb (Mahout version 0.6)
>>> against the Reuters data. I just added another
>>> entry to the cluster-reuters.sh example script as follows:
>>>
>>> ******************************************************************************
>>> elif [ "x$clustertype" == "xcvb" ]; then
>>> $MAHOUT seq2sparse \
>>> -i ${WORK_DIR}/reuters-out-seqdir/ \
>>> -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \
>>> -wt tf -seq -nr 3 --namedVector \
>>> && \
>>> $MAHOUT cvb \
>>> -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \
>>> -o ${WORK_DIR}/reuters-cvb -k 20 -ow -x 2 \
>>> -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>>> -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb \
>>> && \
>>> $MAHOUT ldatopics \
>>> -i ${WORK_DIR}/reuters-cvb/state-2 \
>>> -d ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>>> -dt sequencefile
>>>
>>> ******************************************************************************
>>> I successfully ran the previous LDA algorithm against Reuters but I am
>>> most interested in this new implementation of LDA because I want the new
>>> feature that generates document-to-cluster mappings (e.g., parameter –dt).
>>>
>>> When I run the above code via Hadoop pseudo distributed mode as well as on
>>> a small cluster I receive the same error from the "mahout cvb" command.
>>> All the pre-clustering logic including sequence file and sparse vector
>>> generation works fine but when the cvb clustering is attempted the mappers
>>> fail with the following error in the Hadoop map task log:
>>>
>>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>>> org.apache.hadoop.io.IntWritable
>>> at
>>> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
>>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>> at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>
>>> Any help with resolving the problem would be appreciated.
>>>
>>> Dan
>>
>>
>>
>>
>> --
>>
>> -jake
|