mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Divya" <di...@k2associates.com.sg>
Subject Kmeans Clustering error with XML input
Date Wed, 03 Nov 2010 09:44:34 GMT
Hi,

 

Steps I am following for K Means clustering :

I am using one of the chunk of Wikipedia as an input

 

Convert XML into sequence format 

$ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input  -o
D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8

 

Convert Sequence format to Vector format 

$ bin/mahout seqdirectory -i
D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
7-pages-articles1.xml  -o D:/

MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8

 

Cluster data 

$ bin/mahout kmeans -i  D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
-o D:/MahoutResult/wikipedia/kmeans  -c D:/MahoutResult/wik

ipedia/kmeans -k 10  -x 20 -ow -cl

 

 

Whenever I am trying to run Kmeans clustering having XML file as an input 

I am getting following error 

 

Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2

HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf

10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c

onvergenceDelta=0.5,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
Measure, --endPhase=2147483647, --inpu

t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
--method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki

pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}

10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
D:/MahoutResult/wikipedia/kmeans

10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes wher

e applicable

10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
Size: 1

        at java.util.ArrayList.RangeCheck(ArrayList.java:547)

        at java.util.ArrayList.get(ArrayList.java:322)

        at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
edGenerator.java:107)

        at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)

        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

        at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)

        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)

        at java.lang.reflect.Method.invoke(Method.java:597)

        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
.java:68)

        at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)

        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)

        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)

        at java.lang.reflect.Method.invoke(Method.java:597)

        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 

 

 

Am I not suppose to use XML file as an input?

 

 

Regards,

Divya 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message