mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "antonio d'agata" <antoniodag...@gmail.com>
Subject Re: LDA clustering documentation (mahout-07-snapshot)
Date Fri, 13 Apr 2012 10:54:15 GMT
Hi Jake,

before I didn't understand what you are meaning.

I tried the command cvb as (N.B. for now I'm working without hadoop):

*mahout cvb -i DB-vectors/tfidf-vectors -dict DB-vectors/dictionary.file-0
-o DB-CVB-output -dt DB-CVB-document -k 50 -mt DB-CVB-states -x 10 -tf 0.2*

but it gives me the error:

MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/Users/antoniodagata/mahout-distribution-07/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/Users/antoniodagata/mahout-distribution-07/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
12/04/13 12:48:46 WARN driver.MahoutDriver: No cvb.props found on
classpath, will use command-line arguments only
12/04/13 12:48:46 INFO common.AbstractJob: Command line arguments:
{--convergenceDelta=[0], --dictionary=[DB-vectors/dictionary.file-0],
--doc_topic_output=[DB-CVB-document], --doc_topic_smoothing=[0.0001],
--endPhase=[2147483647], --input=[DB-vectors/tfidf-vectors],
--iteration_block_size=[10], --maxIter=[10], --max_doc_topic_iters=[10],
--num_reduce_tasks=[10], --num_topics=[50], --num_train_threads=[4],
--num_update_threads=[1], --output=[DB-CVB-output], --startPhase=[0],
--tempDir=[temp], --term_topic_smoothing=[0.0001],
--test_set_fraction=[0.2], --topic_model_temp_dir=[DB-CVB-states]}
12/04/13 12:48:46 INFO cvb.CVB0Driver: Will run Collapsed Variational Bayes
(0th-derivative approximation) learning for LDA on DB-vectors/tfidf-vectors
(numTerms: 8783), finding 50-topics, with document/topic prior 1.0E-4,
topic/term prior 1.0E-4.  Maximum iterations to run will be 10, unless the
change in perplexity is less than 0.0.  Topic model output (p(term|topic)
for each topic) will be stored DB-CVB-output.  Random initialization seed
is 3000, holding out 0.2 of the data for perplexity check

12/04/13 12:48:46 INFO cvb.CVB0Driver: Dictionary to be used located
DB-vectors/dictionary.file-0
p(topic|docId) will be stored DB-CVB-document

12/04/13 12:48:46 INFO cvb.CVB0Driver: Current iteration number: 0
12/04/13 12:48:46 INFO cvb.CVB0Driver: About to run iteration 1 of 10
12/04/13 12:48:46 INFO cvb.CVB0Driver: About to run: Iteration 1 of 10,
input path: DB-CVB-states/model-0
12/04/13 12:48:46 INFO input.FileInputFormat: Total input paths to process
: 1
12/04/13 12:48:47 INFO mapred.JobClient: Running job: job_local_0001
12/04/13 12:48:47 INFO mapred.MapTask: io.sort.mb = 100
12/04/13 12:48:47 INFO mapred.MapTask: data buffer = 79691776/99614720
12/04/13 12:48:47 INFO mapred.MapTask: record buffer = 262144/327680
12/04/13 12:48:47 INFO cvb.CachingCVB0Mapper: Retrieving configuration
12/04/13 12:48:47 INFO cvb.CachingCVB0Mapper: Initializing read model
12/04/13 12:48:47 INFO cvb.CachingCVB0Mapper: No model files found
12/04/13 12:48:47 INFO cvb.CachingCVB0Mapper: Initializing write model
12/04/13 12:48:47 INFO cvb.CachingCVB0Mapper: Initializing model trainer
12/04/13 12:48:47 INFO cvb.ModelTrainer: Starting training threadpool with
4 threads
12/04/13 12:48:47 WARN mapred.LocalJobRunner: job_local_0001
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.IntWritable
at
org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
12/04/13 12:48:48 INFO mapred.JobClient:  map 0% reduce 0%
12/04/13 12:48:48 INFO mapred.JobClient: Job complete: job_local_0001
12/04/13 12:48:48 INFO mapred.JobClient: Counters: 0
Exception in thread "main" java.lang.InterruptedException: Failed to
complete iteration 1 stage 1
at
org.apache.mahout.clustering.lda.cvb.CVB0Driver.runIteration(CVB0Driver.java:518)
at org.apache.mahout.clustering.lda.cvb.CVB0Driver.run(CVB0Driver.java:304)
at org.apache.mahout.clustering.lda.cvb.CVB0Driver.run(CVB0Driver.java:187)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.clustering.lda.cvb.CVB0Driver.main(CVB0Driver.java:550)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
12/04/13 12:48:53 INFO mapred.LocalJobRunner:


I've also tried the other command:

*mahout cvb0_local -i DB-vectors/tfidf-vectors -top 50 -m 10 -b 10 -d
DB-vectors/dictionary.file-0 -rdt -do DB-CVB-document-mem -to
DB-CVB-topic-mem *

but in this case the error is:

MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
no HADOOP_HOME set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/Users/antoniodagata/mahout-distribution-07/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/Users/antoniodagata/mahout-distribution-07/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
12/04/13 12:51:28 WARN driver.MahoutDriver: No cvb0_local.props found on
classpath, will use command-line arguments only
Exception in thread "main" java.lang.ClassCastException: java.lang.String
cannot be cast to java.lang.Integer
at
org.apache.mahout.clustering.lda.cvb.InMemoryCollapsedVariationalBayes0.main2(InMemoryCollapsedVariationalBayes0.java:369)
at
org.apache.mahout.clustering.lda.cvb.InMemoryCollapsedVariationalBayes0.run(InMemoryCollapsedVariationalBayes0.java:498)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at
org.apache.mahout.clustering.lda.cvb.InMemoryCollapsedVariationalBayes0.main(InMemoryCollapsedVariationalBayes0.java:502)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)



2012/4/12 Jake Mannix <jake.mannix@gmail.com>

> Hi Antonio,
>
>  Are you using the new LDA (invoked via "$MAHOUT_HOME/bin/mahout cvb
> <args>",
> or by invoking the class org.apache.mahout.clustering.lda.cvb.CVB0Driver
> manually)?
>
>  If so, then your first command should work fine:
>
> mahout vectordump -i DB-LDA-clusters/docTopics/part-m-00000
> -o output/cluster_lda_topics.txt
>
>   What error do you get?
>
> On Thu, Apr 12, 2012 at 6:21 AM, antonio d'agata <antoniodagata@gmail.com
> >wrote:
>
> > Dear users,
> >
> > I'm trying to use lda clustering algorithm by command line (using
> > mahout-07-snapshot) and I was able to get the topics (as text file
> > containing the top words) but I need also to get the documents id
> > associated to the calculated topics.
> >
> > I tried this commands:
> > mahout vectordump -i DB-LDA-clusters/docTopics/part-m-00000 -o
> > output/cluster_lda_topics.txt
> > mahout vectordump -i DB-LDA-clusters/docTopics/part-m-00000 -o
> > output/cluster_lda_topics.txt -dt text(or sequencefile)
> > but without success.
> >
> > Is there a way to do such work?
> >
> > Thanks
> >
> > Antonio Michelangelo D'Agata
> >
>
>
>
> --
>
>  -jake
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message