mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Divya" <di...@k2associates.com.sg>
Subject RE: Kmeans Clustering error with XML input
Date Thu, 04 Nov 2010 01:37:15 GMT
Hi,

My XML input file is just 64 MB i.e. I am using one of the chunk of
Wikipedia example.
Still I need to break this XML to get rid of the below error?


Thanks in advance  
Regards,
Divya 

-----Original Message-----
From: Matt Spitz [mailto:mspitz@meebo-inc.com] 
Sent: Wednesday, November 03, 2010 8:54 PM
To: user@mahout.apache.org
Cc: dev@mahout.apache.org
Subject: Re: Kmeans Clustering error with XML input

Divya-

Are you using just one input file?  As far as I understand, seqdirectory
creates one document per file in your input directory.  When you try to
cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException
when generating the random input clusters.  Which is just as well, because
your output won't be very interesting, anyway.

Break the XML into at least 10 documents, and you should have better luck.

-Matt

On Wed, Nov 3, 2010 at 5:44 AM, Divya <divya@k2associates.com.sg> wrote:

> Hi,
>
>
>
> Steps I am following for K Means clustering :
>
> I am using one of the chunk of Wikipedia as an input
>
>
>
> Convert XML into sequence format
>
> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input  -o
> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8
>
>
>
> Convert Sequence format to Vector format
>
> $ bin/mahout seqdirectory -i
>
>
D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
> 7-pages-articles1.xml  -o D:/
>
> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8
>
>
>
> Cluster data
>
> $ bin/mahout kmeans -i  D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
> -o D:/MahoutResult/wikipedia/kmeans  -c D:/MahoutResult/wik
>
> ipedia/kmeans -k 10  -x 20 -ow -cl
>
>
>
>
>
> Whenever I am trying to run Kmeans clustering having XML file as an input
>
> I am getting following error
>
>
>
> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
>
> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
>
> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c
>
> onvergenceDelta=0.5,
>
>
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
> Measure, --endPhase=2147483647, --inpu
>
> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki
>
> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}
>
> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
> D:/MahoutResult/wikipedia/kmeans
>
> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes wher
>
> e applicable
>
> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor
>
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
> Size: 1
>
>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>
>        at java.util.ArrayList.get(ArrayList.java:322)
>
>        at
>
>
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
> edGenerator.java:107)
>
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>
>        at
>
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>        at
>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>
>        at
>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>
>        at java.lang.reflect.Method.invoke(Method.java:597)
>
>        at
>
>
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
> .java:68)
>
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>
>        at
org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>        at
>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>
>        at
>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>
>        at java.lang.reflect.Method.invoke(Method.java:597)
>
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
>
>
>
>
>
>
> Am I not suppose to use XML file as an input?
>
>
>
>
>
> Regards,
>
> Divya
>
>


Mime
View raw message