Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 7680 invoked from network); 8 Nov 2010 13:52:36 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Nov 2010 13:52:36 -0000 Received: (qmail 73901 invoked by uid 500); 8 Nov 2010 13:53:07 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 73314 invoked by uid 500); 8 Nov 2010 13:53:05 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 73306 invoked by uid 99); 8 Nov 2010 13:53:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Nov 2010 13:53:03 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mspitz@meebo-inc.com designates 74.125.149.211 as permitted sender) Received: from [74.125.149.211] (HELO na3sys009aog114.obsmtp.com) (74.125.149.211) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Nov 2010 13:52:57 +0000 Received: from source ([209.85.215.47]) by na3sys009aob114.postini.com ([74.125.148.12]) with SMTP ID DSNKTNgAo8fsMtLy85RgzzsBjn+hLSLRxC8/@postini.com; Mon, 08 Nov 2010 05:52:36 PST Received: by mail-ew0-f47.google.com with SMTP id 2so2592631ewy.20 for ; Mon, 08 Nov 2010 05:52:35 -0800 (PST) Received: by 10.213.33.82 with SMTP id g18mr2170711ebd.64.1289224355467; Mon, 08 Nov 2010 05:52:35 -0800 (PST) MIME-Version: 1.0 Received: by 10.14.121.212 with HTTP; Mon, 8 Nov 2010 05:52:15 -0800 (PST) In-Reply-To: <004201cb7f0f$c0215cc0$40641640$@com.sg> References: <002f01cb7b3b$baa2a7e0$2fe7f7a0$@com.sg> <000901cb7bc0$d0750ba0$715f22e0$@com.sg> <000d01cb7bc7$a82b4ef0$f881ecd0$@com.sg> <004201cb7f0f$c0215cc0$40641640$@com.sg> From: Matt Spitz Date: Mon, 8 Nov 2010 08:52:15 -0500 Message-ID: Subject: Re: RE: Kmeans Clustering error with XML input To: Divya Cc: user@mahout.apache.org Content-Type: multipart/alternative; boundary=0015174c435cbe52c804948aef39 --0015174c435cbe52c804948aef39 Content-Type: text/plain; charset=ISO-8859-1 Divya, 'seqdirectory' creates a document for every file in the directory you pass in. If there's just one file, there's just one document, and that's not very interesting. You basically have two options: 1) Parse the XML file once and break it into 1000s of little files (one per document, however you define it) 2) Write a new 'seqdirectory' that creates a sequence file based on parsed XML input. This actually isn't too difficult, as the seqdirectory code is pretty straightforward (thanks to whomever did that!). -Matt On Mon, Nov 8, 2010 at 1:39 AM, Divya wrote: > Hi Matt, > > I have an XML input file like Wikipedia XML and try to find similar > documents using K means clustering. > > But If pass whole XML file(size 64 MB) as an during kmeans clustering I am > getting error. > > > > According to your short answer , if I have 1000 s documents in an XML file > I should split my XML file in 1000s chunks. > > > > Is there any other way I can get similar documents ? > > > > > > > > Regards, > > Divya > > > > *From:* Matt Spitz [mailto:mspitz@meebo-inc.com] > *Sent:* Thursday, November 04, 2010 8:46 PM > *To:* Divya > *Cc:* user@mahout.apache.org > > *Subject:* Re: RE: Kmeans Clustering error with XML input > > > > Divya- > > > > A document is what the clustering algorithm operates on. It finds > similarities among the documents and places similar documents into clusters. > The 'seqdirectory' command expects you to have a single document in every > file in the input directory. What do you expect to happen with your > Wikipedia clustering? What are you trying to do? > > > > Short answer: yes, split the XML file by the tags, putting each > element in its own separate file. > > > > -Matt > > > > On Wed, Nov 3, 2010 at 10:26 PM, Divya wrote: > > Hi Matt, > I have Split my file in 10 chunks of 10 MB each. > Still getting the error. > Do you mean the I should split XML file in (in wikipeadia example > ). > > I didn't understand what one file = one document meant to. > > Regards, > Divya > > > > > > $ bin/mahout kmeans -i D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors > > -o D:/MahoutResult/wikipedia/Kmeans -dm org.apache.mahout > .common.distance.CosineDistanceMeasure -c D:/MahoutResult/wikipedia/Kmeans > > -k 10 -x 20 -ow -cl > > Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2 > HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf > > 10/11/04 10:21:21 INFO common.AbstractJob: Command line arguments: > {--clustering=null, --clusters=D:/MahoutResult/wikipedia/Kmeans, --c > onvergenceDelta=0.5, > --distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure, > > --endPhase=2147483647, --input=D:/Mahou > tResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20, > --method=mapreduce, --numClusters=10, > > --output=D:/MahoutResult/wikipedia/Kmea > > ns, --overwrite=null, --startPhase=0, --tempDir=temp} > > 10/11/04 10:21:22 INFO common.HadoopUtil: Deleting > D:/MahoutResult/wikipedia/Kmeans > 10/11/04 10:21:22 WARN util.NativeCodeLoader: Unable to load native-hadoop > > library for your platform... using builtin-java classes wher > e applicable > > 10/11/04 10:21:22 INFO compress.CodecPool: Got brand-new compressor > > Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 5, > Size: 5 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at > > org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe > edGenerator.java:107) > at > org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 > ) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl > .java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver > .java:68) > at > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 > ) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl > .java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > -----Original Message----- > From: Matt Spitz [mailto:mspitz@meebo-inc.com] > > Sent: Thursday, November 04, 2010 9:44 AM > To: user@mahout.apache.org > Cc: dev@mahout.apache.org > > Subject: Re: RE: Kmeans Clustering error with XML input > > Yes. One file = one document. > > Break the file into meaningful documents, one per file, and you should be > golden. The algorithm will then cluster these documents. > > --- > Sent while mobile. Please forgive brevity and typos. > On Nov 3, 2010 9:37 PM, "Divya" wrote: > > Hi, > > > > My XML input file is just 64 MB i.e. I am using one of the chunk of > > Wikipedia example. > > Still I need to break this XML to get rid of the below error? > > > > > > Thanks in advance > > Regards, > > Divya > > > > -----Original Message----- > > From: Matt Spitz [mailto:mspitz@meebo-inc.com] > > Sent: Wednesday, November 03, 2010 8:54 PM > > To: user@mahout.apache.org > > Cc: dev@mahout.apache.org > > Subject: Re: Kmeans Clustering error with XML input > > > > Divya- > > > > Are you using just one input file? As far as I understand, seqdirectory > > creates one document per file in your input directory. When you try to > > cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException > > when generating the random input clusters. Which is just as well, because > > your output won't be very interesting, anyway. > > > > Break the XML into at least 10 documents, and you should have better > luck. > > > > -Matt > > > > On Wed, Nov 3, 2010 at 5:44 AM, Divya wrote: > > > >> Hi, > >> > >> > >> > >> Steps I am following for K Means clustering : > >> > >> I am using one of the chunk of Wikipedia as an input > >> > >> > >> > >> Convert XML into sequence format > >> > >> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input -o > >> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8 > >> > >> > >> > >> Convert Sequence format to Vector format > >> > >> $ bin/mahout seqdirectory -i > >> > >> > > > > D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052 > >> 7-pages-articles1.xml -o D:/ > >> > >> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8 > >> > >> > >> > >> Cluster data > >> > >> $ bin/mahout kmeans -i > D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors > >> -o D:/MahoutResult/wikipedia/kmeans -c D:/MahoutResult/wik > >> > >> ipedia/kmeans -k 10 -x 20 -ow -cl > >> > >> > >> > >> > >> > >> Whenever I am trying to run Kmeans clustering having XML file as an > input > >> > >> I am getting following error > >> > >> > >> > >> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2 > >> > >> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf > >> > >> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments: > >> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c > >> > >> onvergenceDelta=0.5, > >> > >> > > > > --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance > >> Measure, --endPhase=2147483647, --inpu > >> > >> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20, > >> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki > >> > >> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp} > >> > >> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting > >> D:/MahoutResult/wikipedia/kmeans > >> > >> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load > native-hadoop > >> library for your platform... using builtin-java classes wher > >> > >> e applicable > >> > >> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor > >> > >> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: > 1, > >> Size: 1 > >> > >> at java.util.ArrayList.RangeCheck(ArrayList.java:547) > >> > >> at java.util.ArrayList.get(ArrayList.java:322) > >> > >> at > >> > >> > > > > org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe > >> edGenerator.java:107) > >> > >> at > >> > org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96) > >> > >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >> > >> at > >> > > > org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54) > >> > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >> > >> at > >> > >> > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 > >> ) > >> > >> at > >> > >> > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl > >> .java:25) > >> > >> at java.lang.reflect.Method.invoke(Method.java:597) > >> > >> at > >> > >> > > > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver > >> .java:68) > >> > >> at > >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > >> > >> at > > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184) > >> > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >> > >> at > >> > >> > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 > >> ) > >> > >> at > >> > >> > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl > >> .java:25) > >> > >> at java.lang.reflect.Method.invoke(Method.java:597) > >> > >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > >> > >> > >> > >> > >> > >> > >> > >> Am I not suppose to use XML file as an input? > >> > >> > >> > >> > >> > >> Regards, > >> > >> Divya > >> > >> > > > > > --0015174c435cbe52c804948aef39--