mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kate Ericson <moving...@gmail.com>
Subject Dirichlet Clustering with Reuters dataset
Date Mon, 07 Feb 2011 21:07:42 GMT
Hi all,

I'm trying to run Dirichlet Clustering on the Reuters dataset.  I've
created the sparse vectors from the dataset, and I've been running
k-means on it, but when I try to run dirichlet on the same
tfidf-vectors/ directory I'm getting a JSON error.
I'm using mahout 0.4, and trying to run this locally.  The command is:

bin/mahout dirichlet -i ../reuters-normalized-bigram/tfidf-vectors/ -o
reuters-dirichlet-clusters -k 60 -x 5 -a0 1.0 -md
org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution
-mp org.apache.mahout.math.SequentialAccessSparseVector -ow

The output is pasted at the end (it's very long).

Any help would be appreciated.

-Kate

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2
No HADOOP_CONF_DIR set, using /usr/local/hadoop-0.20.2/conf
11/02/07 14:03:51 INFO common.AbstractJob: Command line arguments:
{--alpha=1.0, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--emitMostLikely=true, --endPhase=2147483647,
--input=../reuters-normalized-bigram/tfidf-vectors/, --maxIter=5,
--method=mapreduce,
--modelDist=org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution,
--modelPrototype=org.apache.mahout.math.SequentialAccessSparseVector,
--numClusters=60, --output=reuters-dirichlet-clusters,
--overwrite=null, --startPhase=0, --tempDir=temp, --threshold=0}
11/02/07 14:03:51 INFO common.HadoopUtil: Deleting reuters-dirichlet-clusters
11/02/07 14:03:52 INFO dirichlet.DirichletDriver: Iteration 1
11/02/07 14:03:52 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/02/07 14:03:53 INFO input.FileInputFormat: Total input paths to process : 1
11/02/07 14:03:53 INFO mapred.JobClient: Running job: job_local_0001
11/02/07 14:03:53 INFO input.FileInputFormat: Total input paths to process : 1
11/02/07 14:03:53 INFO mapred.MapTask: io.sort.mb = 100
11/02/07 14:03:53 INFO mapred.MapTask: data buffer = 79691776/99614720
11/02/07 14:03:53 INFO mapred.MapTask: record buffer = 262144/327680
11/02/07 14:03:53 WARN mapred.LocalJobRunner: job_local_0001
com.google.gson.JsonParseException: Failed parsing JSON source:
java.io.StringReader@9ac5f13 to Json
	at com.google.gson.JsonParser.parse(JsonParser.java:59)
	at com.google.gson.Gson.fromJson(Gson.java:376)
	at com.google.gson.Gson.fromJson(Gson.java:329)
	at com.google.gson.Gson.fromJson(Gson.java:305)
	at org.apache.mahout.math.JsonVectorAdapter.deserialize(JsonVectorAdapter.java:63)
	at org.apache.mahout.math.JsonVectorAdapter.deserialize(JsonVectorAdapter.java:32)
	at com.google.gson.JsonDeserializerExceptionWrapper.deserialize(JsonDeserializerExceptionWrapper.java:50)
	at com.google.gson.JsonObjectDeserializationVisitor.visitFieldUsingCustomHandler(JsonObjectDeserializationVisitor.java:115)
	at com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:141)
	at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
	at com.google.gson.JsonDeserializationVisitor.visitChild(JsonDeserializationVisitor.java:87)
	at com.google.gson.JsonDeserializationVisitor.visitChildAsObject(JsonDeserializationVisitor.java:75)
	at com.google.gson.JsonObjectDeserializationVisitor.visitObjectField(JsonObjectDeserializationVisitor.java:62)
	at com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:147)
	at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
	at com.google.gson.JsonDeserializationContextDefault.fromJsonObject(JsonDeserializationContextDefault.java:73)
	at com.google.gson.JsonDeserializationContextDefault.deserialize(JsonDeserializationContextDefault.java:49)
	at com.google.gson.Gson.fromJson(Gson.java:379)
	at com.google.gson.Gson.fromJson(Gson.java:329)
	at com.google.gson.Gson.fromJson(Gson.java:305)
	at org.apache.mahout.clustering.JsonModelDistributionAdapter.deserialize(JsonModelDistributionAdapter.java:69)
	at org.apache.mahout.clustering.JsonModelDistributionAdapter.deserialize(JsonModelDistributionAdapter.java:37)
	at com.google.gson.JsonDeserializerExceptionWrapper.deserialize(JsonDeserializerExceptionWrapper.java:50)
	at com.google.gson.JsonDeserializationVisitor.visitUsingCustomHandler(JsonDeserializationVisitor.java:65)
	at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:96)
	at com.google.gson.JsonDeserializationContextDefault.fromJsonObject(JsonDeserializationContextDefault.java:73)
	at com.google.gson.JsonDeserializationContextDefault.deserialize(JsonDeserializationContextDefault.java:49)
	at com.google.gson.Gson.fromJson(Gson.java:379)
	at com.google.gson.Gson.fromJson(Gson.java:329)
	at org.apache.mahout.clustering.dirichlet.DirichletMapper.getDirichletState(DirichletMapper.java:81)
	at org.apache.mahout.clustering.dirichlet.DirichletMapper.setup(DirichletMapper.java:55)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.StackOverflowError
	at com.google.gson.JsonParserJavacc.jj_3R_4(JsonParserJavacc.java:387)
	at com.google.gson.JsonParserJavacc.jj_3R_3(JsonParserJavacc.java:394)
	at com.google.gson.JsonParserJavacc.jj_3R_1(JsonParserJavacc.java:414)
	at com.google.gson.JsonParserJavacc.jj_3_1(JsonParserJavacc.java:400)
	at com.google.gson.JsonParserJavacc.jj_2_1(JsonParserJavacc.java:381)
	at com.google.gson.JsonParserJavacc.JsonNumber(JsonParserJavacc.java:229)
	at com.google.gson.JsonParserJavacc.JsonValue(JsonParserJavacc.java:166)
	at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:142)
	at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
...(repeated for ~1000 lines)
11/02/07 14:03:54 INFO mapred.JobClient:  map 0% reduce 0%
11/02/07 14:03:54 INFO mapred.JobClient: Job complete: job_local_0001
11/02/07 14:03:54 INFO mapred.JobClient: Counters: 0
Exception in thread "main" java.lang.InterruptedException: Dirichlet
Iteration failed processing reuters-dirichlet-clusters/clusters-0
	at org.apache.mahout.clustering.dirichlet.DirichletDriver.runIteration(DirichletDriver.java:375)
	at org.apache.mahout.clustering.dirichlet.DirichletDriver.buildClustersMR(DirichletDriver.java:475)
	at org.apache.mahout.clustering.dirichlet.DirichletDriver.buildClusters(DirichletDriver.java:413)
	at org.apache.mahout.clustering.dirichlet.DirichletDriver.run(DirichletDriver.java:187)
	at org.apache.mahout.clustering.dirichlet.DirichletDriver.run(DirichletDriver.java:137)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:83)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Mime
View raw message