mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Clustering performance
Date Sun, 05 Dec 2010 12:58:45 GMT
Hi Jure,

I've tried a file that is 80MB, and that still isn't really enough to drive it without changing
the settings.  Mahout/Hadoop is really geared for much bigger problems.  Someday, it would
be great to have algorithms that are fast no matter the size of the input, but for now you
may want to either use a non-Hadoop version or add more data.

-Grant

On Dec 3, 2010, at 9:39 AM, Jeff Eastman wrote:

> If you are using sequence files of VectorWritable they should be splittable. 4mb is not
a big file for Hadoop. I'm not surprised you have only 1 mapper. You could break your vectors
up into multiple smaller files or decrease the split size to get more splits and hence mappers.
Have you tried the sequential version (-xm sequential)?
> 
> -----Original Message-----
> From: Jure Jeseničnik [mailto:Jure.Jesenicnik@planet9.si] 
> Sent: Thursday, December 02, 2010 10:34 PM
> To: user@mahout.apache.org
> Subject: RE: Clustering performance
> 
> I think there only one mapper. I'm gonna have to check this with my system people.  I'm
pretty sure We're gonna need to decrease the split size, since the input file is only 4MB
big.
> 
> Input file is a sequence file generated by the lucene script. I posted it an example
in my previous post.
> 
> Best regards.
> 
> Jure
> 
> 
> 
> 
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com] 
> Sent: Friday, December 03, 2010 7:24 AM
> To: user@mahout.apache.org
> Subject: Re: Clustering performance
> 
> When you run the job, how many maps are there?
> 
> 2010/12/2 Jure Jeseničnik <Jure.Jesenicnik@planet9.si>
> 
>> How can I see if the file is splittable or not? If not, how to make it
>> splittable?
>> 
>> Regards.
>> 
>> Jure
>> 
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> Sent: Thursday, December 02, 2010 4:49 PM
>> To: user@mahout.apache.org
>> Subject: Re: Clustering performance
>> 
>> How many maps does Hadoop schedule?  If the number is small, then you need
>> to decrease the split size and make sure that your input file is
>> splittable.
>> 
>> 2010/12/2 Jure Jeseničnik <Jure.Jesenicnik@planet9.si>
>> 
>>> I have already explained my mission here:
>>> 
>>> 
>>> 
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201011.mbox/%3C0EDE11E319B0B043B4F24E0305CABF7C80413134A4@P9MAIL.p9.internal%3E
>>> 
>>> 
>>> 
>>> Using the trial & error method I’ve managed to found the most appropriate
>>> input parameters for canopy. That would be T1=1.4, T2=1.2 this gives me
>>> somewhere around 7000 clusters from 7800 input documents, which is
>> exactly
>>> the result I’ve been looking for. I’m trying to put together the news
>> from
>>> different sources that talk about the same story.
>>> 
>>> What bothers me now is the performance. To complete this task of
>> processing
>>> a 3.6 MB big file, on my pretty decent 4 core desktop machine,  mahout
>> needs
>>> a good 14 minutes. I know I’m dealing with pretty large number of
>> clusters
>>> but, but still. 14 minutes is a huge amount of time.  If I use a smaller
>>> amount of data (1700 docs) it is all over in less than a minute.
>>> 
>>> When running locally, mahout was only consuming one cpu core? I’m running
>>> it on win 7 through  Cygwin, but it behaved pretty the same on some
>> proper
>>> linux machines. How could I make it use all the available cpu power?
>>> 
>>> I also tried running this  on a Hadoop cluster, but there seemed to be no
>>> significant improvement in time.  It seemed like  hadoop was unable to
>>> properly distribute such a small task.
>>> 
>>> Is it possible that I missed something here.  What can I do to have this
>>> clustering finished in a bit more decent time.
>>> 
>>> 
>>> 
>>> Thank you for your answers.
>>> 
>>> 
>>> 
>>> Jure
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> [image: logo-P9]
>>> 
>>> *Planet 9 d.o.o.*
>>> Vojkova 78
>>> 1000 Ljubljana
>>> Slovenija
>>> -
>>> *Jure Jeseničnik*
>>> Razvijalec aplikacij / Applications developer
>>> jure.jesenicnik@planet9.si <jure.jesenicnikk@planet9.si>*
>>> T* + 386 47 30 375
>>> *F* + 386 1 47 28 550
>>> *M* + 386 41 363 586
>>> 
>>> 
>>> 
>> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com


Mime
View raw message