mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zaki rahaman <zaki.raha...@gmail.com>
Subject Re: Clustering from DB
Date Wed, 15 Jul 2009 21:25:18 GMT
I hope I'm understanding your setup correctly but by running on one machine,
you're not fully exploiting the capabilities of Hadoop's Map/Reduce. Gains
in computation time will only be seen by increasing the number of cores or
nodes. If you need access to more computing power, you might want to
consider using Amazon's EC2 (they have preconfigured AMIs for Hadoop but
youd have to configure and install Mahout, a process which I'm not totally
familiar with as of yet as I'm still trying to do it myself).

On Wed, Jul 15, 2009 at 4:24 PM, nfantone <nfantone@gmail.com> wrote:

> Well, I grew tired of watching the whole thing run and stopped it. I,
> then, started another test, this time around using a smaller dataset
> of 3Gb and it is still taking way too long.
> See inline comments.
>
> > You are only specifying a single reducer. Try increasing that as below.
>
> I did. I set it to my K value (200).
>
> > No, number of nodes is the number of nodes (computers) in your cluster.
> You
> > did not say how many nodes you are running on.
>
> I'm running and compiling the application on one simple desktop
> computer at work, and that isn't likely to change after the
> development process is finished.
>
> > Yes, Hadoop allocates this automatically. How many map tasks are being
> > spawned?
>
> Being uncertain of where (and when) Hadoop computes the adequate
> number of map tasks, what I did was inspect the following 'numMaps'
> variable while debugging:
>
>   if (name.startsWith("part") && !name.endsWith(".crc")) {
>         SequenceFile.Reader reader = new SequenceFile.Reader(fs,
> part.getPath(), conf);
>        int numMaps = conf.getNumMapTasks();
>  ... }
>
> which is at the beginning of the isConverged() method. Its current
> value is, in every iteration, 2. I suspect this isn't right at all,
> whether because this is not the proper place to ask for the number of
> maps or it's not being setted the way it should.
> From the Hadoop Javadoc:
>
> "The number of maps is usually driven by the total size of the inputs
> i.e. total number of blocks of the input files."
>
> In my input file each block represents a vector that corresponds to
> some computed user behavior. The number of users to be clustered (i.e
> the number of blocks) is a parameter expected by the application.
> Perhaps, I should -somehow- change the block size of my HDFS file? Or
> tweak something in the Configuration/FileSystem instance i'm using to
> write it?
>
> >> For now, the clustering is STILL running in the background, ha.
> >>
> >> On Wed, Jul 15, 2009 at 12:30 PM, Jeff
> >> Eastman<jdog@windwardsolutions.com> wrote:
> >>
> >>>
> >>> Glad to hear KMeans is working reliably now. Your performance problems
> >>> will
> >>> require some additional tuning. Here are some suggestions:
> >>> - You did not mention how many mappers are running in your job. With
> 60gb
> >>> in
> >>> a single input file, I would think Hadoop would allocate multiple
> mapper
> >>> tasks automatically, since there are thousands of potential splits. If
> >>> this
> >>> is not happening (is the file compressed?), then breaking it into
> >>> multiple
> >>> parts in a preprocessing step would allow you to get more concurrency
> in
> >>> the
> >>> map phase.
> >>> - Same with the reducers; how many are you running and what is your K?
> >>> The
> >>> default number of reducers is 2, but you can increase this up to the
> >>> number
> >>> of clusters to increase parallelism. Unlike Canopy and Mean Shift,
> KMeans
> >>> can use multiple reducers up to that limit.
> >>> - Finally, what is the size of your cluster? Adding machines would be
> >>> another way to increase concurrency, since map and reduce tasks are
> >>> spread
> >>> across the entire cluster.
> >>>
> >>> 60 gb is a small dataset for Hadoop. I don't think it should be taking
> >>> that
> >>> long.
> >>> Jeff
> >>>
> >>> nfantone wrote:
> >>>
> >>>>
> >>>> After updating to the latest revision, everything seems to be working
> >>>> just fine. However, the task I set up to do, user clustering by
> >>>> KMeans, is taking forever to complete: I initiated the job yesterday's
> >>>> morning and it's still running today (an elapsed time of nearly 18hs
> >>>> and counting...). Of course, the main reason behind it it's the huge
> >>>> size of the data set I'm trying to process (a ~60Gb HDFS file), but
> >>>> I'm looking for ways to improve the performance. Would splitting the
> >>>> input file into smaller parts do any difference? Is it even possible
> >>>> to set the Driver in order to use more than one input (right now, I'm
> >>>> specifying a full path to a single file, including its filename)? What
> >>>> about setting a higher number of reducers? Is there any drawbacks to
> >>>> that? Running multiple KMeans' job in several threads?
> >>>>
> >>>> Or perhaps, I'm just doing something wrong and should not be taking
> >>>> this long. Surely, I'm not the first one to encounter this running
> >>>> time issue with large datasets. Ideas, anyone?
> >>>>
> >>>>
> >>>> On Mon, Jul 13, 2009 at 2:39 PM, nfantone<nfantone@gmail.com>
wrote:
> >>>>
> >>>>
> >>>>>
> >>>>> Great work. It works like a charm now. Thank you very much.
> >>>>>
> >>>>> On Mon, Jul 13, 2009 at 1:41 PM, Jeff
> >>>>> Eastman<jdog@windwardsolutions.com>
> >>>>> wrote:
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> r793620 fixes the KMeansDriver.isConverged() method to iterate
over
> >>>>>> all
> >>>>>> cluster part files. Unit test now runs without error and the
> synthetic
> >>>>>> control job completes too.
> >>>>>>
> >>>>>>
> >>>>>> Jeff Eastman wrote:
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> In this case, the code should be reading all of the clusters
into
> >>>>>>> memory
> >>>>>>> to see if they have all converged. These may be split into
multiple
> >>>>>>> part
> >>>>>>> files if more than one reducer is specified. So /* is the
correct
> >>>>>>> file
> >>>>>>> pattern and it is the calling site that should remove the
> /part-0000
> >>>>>>> reference. The code in isConverged should loop through all
the
> parts,
> >>>>>>> returning if they have all converged or not.
> >>>>>>>
> >>>>>>> I'll take a detailed look tomorrow.
> >>>>>>>
> >>>>>>>
> >>>>>>> Grant Ingersoll wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Hmm, that might be a mistake on my part when trying
to resolve how
> >>>>>>>> Hadoop
> >>>>>>>> 0.20 now resolves globs.  I somewhat blindly applied
"/*" where
> >>>>>>>> needed, but
> >>>>>>>> I think it is likely worth revistiing here where a specific
file
> is
> >>>>>>>> needed?
> >>>>>>>>
> >>>>>>>> -Grant
> >>>>>>>>
> >>>>>>>> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> This error is still bugging me. The exception:
> >>>>>>>>>
> >>>>>>>>> WARNING: java.io.FileNotFoundException: File
> >>>>>>>>> output/clusters-0/part-00000/* does not exist.
> >>>>>>>>> java.io.FileNotFoundException: File
> output/clusters-0/part-00000/*
> >>>>>>>>> does not exist.
> >>>>>>>>>
> >>>>>>>>> ocurrs first at:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)
> >>>>>>>>>
> >>>>>>>>> which corresponds to:
> >>>>>>>>>
> >>>>>>>>>  private static boolean isConverged(String filePath,
JobConf
> conf,
> >>>>>>>>> FileSystem fs)
> >>>>>>>>>   throws IOException {
> >>>>>>>>>  Path outPart = new Path(filePath + "/*");
> >>>>>>>>>  SequenceFile.Reader reader = new SequenceFile.Reader(fs,
> outPart,
> >>>>>>>>> conf);  <-- THIS
> >>>>>>>>>  ...
> >>>>>>>>>  }
> >>>>>>>>>
> >>>>>>>>> where isConverged() is called in this fashion:
> >>>>>>>>>
> >>>>>>>>> return isConverged(clustersOut + "/part-00000",
conf, fs);
> >>>>>>>>>
> >>>>>>>>> by runIteration(), which is previously invoked by
runJob() like:
> >>>>>>>>>
> >>>>>>>>>  String clustersOut = output + "/clusters-" + iteration;
> >>>>>>>>>   converged = runIteration(input, clustersIn, clustersOut,
> >>>>>>>>> measureClass,
> >>>>>>>>>       delta, numReduceTasks, iteration);
> >>>>>>>>>
> >>>>>>>>> Consequently, assuming its the first iteration and
the output
> >>>>>>>>> folder
> >>>>>>>>> has been named "output" by the user, the SequenceFile.Reader
> >>>>>>>>> receives
> >>>>>>>>> "output/clusters-0/part-00000/*" as a path, which
is
> non-existent.
> >>>>>>>>> I
> >>>>>>>>> believe the path should end in "part-00000" and
the  + "/*"
> should
> >>>>>>>>> be
> >>>>>>>>> removed... although someone, evidently, thought
otherwise.
> >>>>>>>>>
> >>>>>>>>> Any feedback?
> >>>>>>>>>
> >>>>>>>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nfantone@gmail.com>
> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I was using Canopy to create input clusters,
but the error
> >>>>>>>>>> appeared
> >>>>>>>>>> while running kMeans (if I run kMeans' job only
with previously
> >>>>>>>>>> created clusters from Canopy placed in output/canopies
as
> initial
> >>>>>>>>>> clusters, it still fails). I noticed no other
problems. I was
> >>>>>>>>>> using
> >>>>>>>>>> revision 790979 before updating.  Strangely,
there were no
> changes
> >>>>>>>>>> in
> >>>>>>>>>> the job and drivers class from that revision.
svn diff shows
> that
> >>>>>>>>>> the
> >>>>>>>>>> only classes that changed in org.apache.mahout.clustering.kmeans
> >>>>>>>>>> package were KMeansInfo.java and RandomSeedGenerator.java
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff
> >>>>>>>>>> Eastman<jdog@windwardsolutions.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hum, no, it's looking for the output of
the first iteration.
> Were
> >>>>>>>>>>> there
> >>>>>>>>>>> other errors? What was the last revision
you were running? It
> >>>>>>>>>>> does
> >>>>>>>>>>> look like
> >>>>>>>>>>> something got horked, as it should be looking
for
> >>>>>>>>>>> output/clusters-0/*.
> >>>>>>>>>>> Can
> >>>>>>>>>>> you diff the job and driver class to see
what changed?
> >>>>>>>>>>>
> >>>>>>>>>>> Jeff
> >>>>>>>>>>>
> >>>>>>>>>>> nfantone wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Fellows, today I updated to revision
791558 and while running
> >>>>>>>>>>>> kMeans
> >>>>>>>>>>>> I
> >>>>>>>>>>>> got the following exception:
> >>>>>>>>>>>>
> >>>>>>>>>>>> WARNING: java.io.FileNotFoundException:
File
> >>>>>>>>>>>> output/clusters-0/part-00000/* does
not exist.
> >>>>>>>>>>>> java.io.FileNotFoundException: File
> >>>>>>>>>>>> output/clusters-0/part-00000/*
> >>>>>>>>>>>> does not exist.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The algorithm isn't interrupted, though.
But this exception
> >>>>>>>>>>>> wasn't
> >>>>>>>>>>>> thrown before the update and, to me,
its message is not quite
> >>>>>>>>>>>> clear.
> >>>>>>>>>>>> It seems as it's looking for any file
inside a "part-00000"
> >>>>>>>>>>>> directory,
> >>>>>>>>>>>> which doesn't exist; and, as far as
I know, "part-xxxxx" are
> >>>>>>>>>>>> default
> >>>>>>>>>>>> names for output files.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I could show the entire stack trace,
if needed. Any pointers?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nfantone@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the feedback, Jeff.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The logical format of input
to KMeans is <Key, Vector> as it
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>> sequence
> >>>>>>>>>>>>>> file format, but the Key is
never used. To my knowledge,
> there
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>> no
> >>>>>>>>>>>>>> requirement to assign identifiers
to the input points*.
> Users
> >>>>>>>>>>>>>> are
> >>>>>>>>>>>>>> free
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>> associate an arbitrary name
field with each vector - also
> >>>>>>>>>>>>>> label
> >>>>>>>>>>>>>> mappings
> >>>>>>>>>>>>>> may
> >>>>>>>>>>>>>> be assigned - but these are
not manipulated by KMeans or any
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> other
> >>>>>>>>>>>>>> clustering applications. The
name field is now used as a
> >>>>>>>>>>>>>> vector
> >>>>>>>>>>>>>> identifier
> >>>>>>>>>>>>>> by the KMeansClusterMapper -
if it is non-null - in the
> output
> >>>>>>>>>>>>>> step
> >>>>>>>>>>>>>> only.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The key may not be used internally,
but externally they can
> >>>>>>>>>>>>> prove
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>> be pretty useful. For me, keys are
userIDs and each Vector
> >>>>>>>>>>>>> represents
> >>>>>>>>>>>>> his/her historical behavior. Being
able to collect the output
> >>>>>>>>>>>>> information as <UserID, ClusterID>
is quite neat as it allows
> >>>>>>>>>>>>> me
> >>>>>>>>>>>>> to,
> >>>>>>>>>>>>> for instance, retrieve user information
using data directly
> >>>>>>>>>>>>> from
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>> HDFS file's field.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>>> --------------------------
> >>>>>>>> Grant Ingersoll
> >>>>>>>> http://www.lucidimagination.com/
> >>>>>>>>
> >>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> >>>>>>>> using
> >>>>>>>> Solr/Lucene:
> >>>>>>>> http://www.lucidimagination.com/search
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
>



-- 
Zaki Rahaman

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message