Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 60017 invoked from network); 5 Dec 2010 12:59:10 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Dec 2010 12:59:10 -0000 Received: (qmail 18879 invoked by uid 500); 5 Dec 2010 12:59:09 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 18768 invoked by uid 500); 5 Dec 2010 12:59:09 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 18760 invoked by uid 99); 5 Dec 2010 12:59:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Dec 2010 12:59:08 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=MIME_QP_LONG_LINE X-Spam-Check-By: apache.org Received: from [140.211.11.9] (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with SMTP; Sun, 05 Dec 2010 12:59:07 +0000 Received: (qmail 59975 invoked by uid 99); 5 Dec 2010 12:58:47 -0000 Received: from localhost.apache.org (HELO [10.0.0.77]) (127.0.0.1) (smtp-auth username gsingers, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Dec 2010 12:58:47 +0000 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Apple Message framework v1082) Subject: Re: Clustering performance From: Grant Ingersoll In-Reply-To: <99CF5A2B2A1D9542A589C5F5EBD3DA03038301EA0D@rock.narus.com> Date: Sun, 5 Dec 2010 07:58:45 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: <3074CCC1-0D8E-4E7E-A4EC-2EA8E547E2A4@apache.org> References: <0EDE11E319B0B043B4F24E0305CABF7C80415AAD6A@P9MAIL.p9.internal> <0EDE11E319B0B043B4F24E0305CABF7C80415AADC1@P9MAIL.p9.internal> <0EDE11E319B0B043B4F24E0305CABF7C80415AADC8@P9MAIL.p9.internal> <99CF5A2B2A1D9542A589C5F5EBD3DA03038301EA0D@rock.narus.com> To: user@mahout.apache.org X-Mailer: Apple Mail (2.1082) Hi Jure, I've tried a file that is 80MB, and that still isn't really enough to = drive it without changing the settings. Mahout/Hadoop is really geared = for much bigger problems. Someday, it would be great to have algorithms = that are fast no matter the size of the input, but for now you may want = to either use a non-Hadoop version or add more data. -Grant On Dec 3, 2010, at 9:39 AM, Jeff Eastman wrote: > If you are using sequence files of VectorWritable they should be = splittable. 4mb is not a big file for Hadoop. I'm not surprised you have = only 1 mapper. You could break your vectors up into multiple smaller = files or decrease the split size to get more splits and hence mappers. = Have you tried the sequential version (-xm sequential)? >=20 > -----Original Message----- > From: Jure Jeseni=C4=8Dnik [mailto:Jure.Jesenicnik@planet9.si]=20 > Sent: Thursday, December 02, 2010 10:34 PM > To: user@mahout.apache.org > Subject: RE: Clustering performance >=20 > I think there only one mapper. I'm gonna have to check this with my = system people. I'm pretty sure We're gonna need to decrease the split = size, since the input file is only 4MB big. >=20 > Input file is a sequence file generated by the lucene script. I posted = it an example in my previous post. >=20 > Best regards. >=20 > Jure >=20 >=20 >=20 >=20 > -----Original Message----- > From: Ted Dunning [mailto:ted.dunning@gmail.com]=20 > Sent: Friday, December 03, 2010 7:24 AM > To: user@mahout.apache.org > Subject: Re: Clustering performance >=20 > When you run the job, how many maps are there? >=20 > 2010/12/2 Jure Jeseni=C4=8Dnik >=20 >> How can I see if the file is splittable or not? If not, how to make = it >> splittable? >>=20 >> Regards. >>=20 >> Jure >>=20 >> -----Original Message----- >> From: Ted Dunning [mailto:ted.dunning@gmail.com] >> Sent: Thursday, December 02, 2010 4:49 PM >> To: user@mahout.apache.org >> Subject: Re: Clustering performance >>=20 >> How many maps does Hadoop schedule? If the number is small, then you = need >> to decrease the split size and make sure that your input file is >> splittable. >>=20 >> 2010/12/2 Jure Jeseni=C4=8Dnik >>=20 >>> I have already explained my mission here: >>>=20 >>>=20 >>>=20 >> = http://mail-archives.apache.org/mod_mbox/mahout-user/201011.mbox/%3C0EDE11= E319B0B043B4F24E0305CABF7C80413134A4@P9MAIL.p9.internal%3E >>>=20 >>>=20 >>>=20 >>> Using the trial & error method I=E2=80=99ve managed to found the = most appropriate >>> input parameters for canopy. That would be T1=3D1.4, T2=3D1.2 this = gives me >>> somewhere around 7000 clusters from 7800 input documents, which is >> exactly >>> the result I=E2=80=99ve been looking for. I=E2=80=99m trying to put = together the news >> from >>> different sources that talk about the same story. >>>=20 >>> What bothers me now is the performance. To complete this task of >> processing >>> a 3.6 MB big file, on my pretty decent 4 core desktop machine, = mahout >> needs >>> a good 14 minutes. I know I=E2=80=99m dealing with pretty large = number of >> clusters >>> but, but still. 14 minutes is a huge amount of time. If I use a = smaller >>> amount of data (1700 docs) it is all over in less than a minute. >>>=20 >>> When running locally, mahout was only consuming one cpu core? I=E2=80=99= m running >>> it on win 7 through Cygwin, but it behaved pretty the same on some >> proper >>> linux machines. How could I make it use all the available cpu power? >>>=20 >>> I also tried running this on a Hadoop cluster, but there seemed to = be no >>> significant improvement in time. It seemed like hadoop was unable = to >>> properly distribute such a small task. >>>=20 >>> Is it possible that I missed something here. What can I do to have = this >>> clustering finished in a bit more decent time. >>>=20 >>>=20 >>>=20 >>> Thank you for your answers. >>>=20 >>>=20 >>>=20 >>> Jure >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>> [image: logo-P9] >>>=20 >>> *Planet 9 d.o.o.* >>> Vojkova 78 >>> 1000 Ljubljana >>> Slovenija >>> - >>> *Jure Jeseni=C4=8Dnik* >>> Razvijalec aplikacij / Applications developer >>> jure.jesenicnik@planet9.si * >>> T* + 386 47 30 375 >>> *F* + 386 1 47 28 550 >>> *M* + 386 41 363 586 >>>=20 >>>=20 >>>=20 >>=20 -------------------------- Grant Ingersoll http://www.lucidimagination.com