mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Spitz <msp...@meebo-inc.com>
Subject Re: RE: Kmeans Clustering error with XML input
Date Thu, 04 Nov 2010 12:46:09 GMT
Divya-

A document is what the clustering algorithm operates on.  It finds
similarities among the documents and places similar documents into clusters.
 The 'seqdirectory' command expects you to have a single document in every
file in the input directory.  What do you expect to happen with your
Wikipedia clustering?  What are you trying to do?

Short answer: yes, split the XML file by the <page> tags, putting each
<page> element in its own separate file.

-Matt

On Wed, Nov 3, 2010 at 10:26 PM, Divya <divya@k2associates.com.sg> wrote:

> Hi Matt,
> I have Split my file in 10 chunks of 10 MB each.
> Still getting  the error.
> Do you mean the I should split XML file in (in wikipeadia example <page>
> </page>).
>
> I didn't understand what one file = one document meant to.
>
> Regards,
> Divya
>
>
>
>
> $ bin/mahout kmeans -i  D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
> -o D:/MahoutResult/wikipedia/Kmeans  -dm  org.apache.mahout
> .common.distance.CosineDistanceMeasure -c D:/MahoutResult/wikipedia/Kmeans
> -k 10  -x 20 -ow -cl
> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
> 10/11/04 10:21:21 INFO common.AbstractJob: Command line arguments:
> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/Kmeans, --c
> onvergenceDelta=0.5,
> --distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure,
> --endPhase=2147483647, --input=D:/Mahou
> tResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
> --method=mapreduce, --numClusters=10,
> --output=D:/MahoutResult/wikipedia/Kmea
> ns, --overwrite=null, --startPhase=0, --tempDir=temp}
> 10/11/04 10:21:22 INFO common.HadoopUtil: Deleting
> D:/MahoutResult/wikipedia/Kmeans
> 10/11/04 10:21:22 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes wher
> e applicable
> 10/11/04 10:21:22 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 5,
> Size: 5
>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>        at java.util.ArrayList.get(ArrayList.java:322)
>        at
>
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
> edGenerator.java:107)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
> .java:68)
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> -----Original Message-----
> From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> Sent: Thursday, November 04, 2010 9:44 AM
> To: user@mahout.apache.org
> Cc: dev@mahout.apache.org
> Subject: Re: RE: Kmeans Clustering error with XML input
>
> Yes. One file = one document.
>
> Break the file into meaningful documents, one per file, and you should be
> golden.  The algorithm will then cluster these documents.
>
> ---
> Sent while mobile. Please forgive brevity and typos.
> On Nov 3, 2010 9:37 PM, "Divya" <divya@k2associates.com.sg> wrote:
> > Hi,
> >
> > My XML input file is just 64 MB i.e. I am using one of the chunk of
> > Wikipedia example.
> > Still I need to break this XML to get rid of the below error?
> >
> >
> > Thanks in advance
> > Regards,
> > Divya
> >
> > -----Original Message-----
> > From: Matt Spitz [mailto:mspitz@meebo-inc.com]
> > Sent: Wednesday, November 03, 2010 8:54 PM
> > To: user@mahout.apache.org
> > Cc: dev@mahout.apache.org
> > Subject: Re: Kmeans Clustering error with XML input
> >
> > Divya-
> >
> > Are you using just one input file? As far as I understand, seqdirectory
> > creates one document per file in your input directory. When you try to
> > cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException
> > when generating the random input clusters. Which is just as well, because
> > your output won't be very interesting, anyway.
> >
> > Break the XML into at least 10 documents, and you should have better
> luck.
> >
> > -Matt
> >
> > On Wed, Nov 3, 2010 at 5:44 AM, Divya <divya@k2associates.com.sg> wrote:
> >
> >> Hi,
> >>
> >>
> >>
> >> Steps I am following for K Means clustering :
> >>
> >> I am using one of the chunk of Wikipedia as an input
> >>
> >>
> >>
> >> Convert XML into sequence format
> >>
> >> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input -o
> >> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8
> >>
> >>
> >>
> >> Convert Sequence format to Vector format
> >>
> >> $ bin/mahout seqdirectory -i
> >>
> >>
> >
>
> D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
> >> 7-pages-articles1.xml -o D:/
> >>
> >> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8
> >>
> >>
> >>
> >> Cluster data
> >>
> >> $ bin/mahout kmeans -i
> D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
> >> -o D:/MahoutResult/wikipedia/kmeans -c D:/MahoutResult/wik
> >>
> >> ipedia/kmeans -k 10 -x 20 -ow -cl
> >>
> >>
> >>
> >>
> >>
> >> Whenever I am trying to run Kmeans clustering having XML file as an
> input
> >>
> >> I am getting following error
> >>
> >>
> >>
> >> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
> >>
> >> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
> >>
> >> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
> >> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c
> >>
> >> onvergenceDelta=0.5,
> >>
> >>
> >
>
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
> >> Measure, --endPhase=2147483647, --inpu
> >>
> >> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
> >> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki
> >>
> >> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}
> >>
> >> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
> >> D:/MahoutResult/wikipedia/kmeans
> >>
> >> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load
> native-hadoop
> >> library for your platform... using builtin-java classes wher
> >>
> >> e applicable
> >>
> >> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor
> >>
> >> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
> 1,
> >> Size: 1
> >>
> >> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> >>
> >> at java.util.ArrayList.get(ArrayList.java:322)
> >>
> >> at
> >>
> >>
> >
>
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
> >> edGenerator.java:107)
> >>
> >> at
> >>
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
> >>
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>
> >> at
> >>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> >>
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> >> )
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> >> .java:25)
> >>
> >> at java.lang.reflect.Method.invoke(Method.java:597)
> >>
> >> at
> >>
> >>
> >
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
> >> .java:68)
> >>
> >> at
> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>
> >> at
> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> >>
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> >> )
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> >> .java:25)
> >>
> >> at java.lang.reflect.Method.invoke(Method.java:597)
> >>
> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Am I not suppose to use XML file as an input?
> >>
> >>
> >>
> >>
> >>
> >> Regards,
> >>
> >> Divya
> >>
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message