mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Clustering performance
Date Fri, 03 Dec 2010 06:24:18 GMT
When you run the job, how many maps are there?

2010/12/2 Jure Jeseničnik <Jure.Jesenicnik@planet9.si>

> How can I see if the file is splittable or not? If not, how to make it
> splittable?
>
> Regards.
>
> Jure
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Thursday, December 02, 2010 4:49 PM
> To: user@mahout.apache.org
> Subject: Re: Clustering performance
>
> How many maps does Hadoop schedule?  If the number is small, then you need
> to decrease the split size and make sure that your input file is
> splittable.
>
> 2010/12/2 Jure Jeseničnik <Jure.Jesenicnik@planet9.si>
>
> > I have already explained my mission here:
> >
> >
> >
> http://mail-archives.apache.org/mod_mbox/mahout-user/201011.mbox/%3C0EDE11E319B0B043B4F24E0305CABF7C80413134A4@P9MAIL.p9.internal%3E
> >
> >
> >
> > Using the trial & error method I’ve managed to found the most appropriate
> > input parameters for canopy. That would be T1=1.4, T2=1.2 this gives me
> > somewhere around 7000 clusters from 7800 input documents, which is
> exactly
> > the result I’ve been looking for. I’m trying to put together the news
> from
> > different sources that talk about the same story.
> >
> > What bothers me now is the performance. To complete this task of
> processing
> > a 3.6 MB big file, on my pretty decent 4 core desktop machine,  mahout
> needs
> > a good 14 minutes. I know I’m dealing with pretty large number of
> clusters
> > but, but still. 14 minutes is a huge amount of time.  If I use a smaller
> > amount of data (1700 docs) it is all over in less than a minute.
> >
> > When running locally, mahout was only consuming one cpu core? I’m running
> > it on win 7 through  Cygwin, but it behaved pretty the same on some
> proper
> > linux machines. How could I make it use all the available cpu power?
> >
> > I also tried running this  on a Hadoop cluster, but there seemed to be no
> > significant improvement in time.  It seemed like  hadoop was unable to
> > properly distribute such a small task.
> >
> > Is it possible that I missed something here.  What can I do to have this
> > clustering finished in a bit more decent time.
> >
> >
> >
> > Thank you for your answers.
> >
> >
> >
> > Jure
> >
> >
> >
> >
> >
> >
> >
> > [image: logo-P9]
> >
> > *Planet 9 d.o.o.*
> > Vojkova 78
> > 1000 Ljubljana
> > Slovenija
> > -
> > *Jure Jeseničnik*
> > Razvijalec aplikacij / Applications developer
> > jure.jesenicnik@planet9.si <jure.jesenicnikk@planet9.si>*
> > T* + 386 47 30 375
> > *F* + 386 1 47 28 550
> > *M* + 386 41 363 586
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message