Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 84365 invoked from network); 15 Jul 2009 12:49:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 15 Jul 2009 12:49:43 -0000 Received: (qmail 74802 invoked by uid 500); 15 Jul 2009 12:49:52 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 74762 invoked by uid 500); 15 Jul 2009 12:49:52 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 74752 invoked by uid 99); 15 Jul 2009 12:49:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Jul 2009 12:49:52 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nfantone@gmail.com designates 209.85.212.184 as permitted sender) Received: from [209.85.212.184] (HELO mail-vw0-f184.google.com) (209.85.212.184) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Jul 2009 12:49:43 +0000 Received: by vwj14 with SMTP id 14so3471699vwj.29 for ; Wed, 15 Jul 2009 05:49:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=MHYFlY1VQxrdETz+nf40TUDKKzB+FCyIXaNulx6Tvzs=; b=WFbZC4ylHmkTSDQrqOYr0622n7kKjOH2YQRnwHaOMTK3zZonuQftVV/s5ujKRUIuMO U/ftJCc7hk/9+3xSj5uIuZ7/LdZvtvrrFutRkW/BjT1bR/LesdKxmivuMTe/a94154IN Os/SCXSNeVbjwwfvRZRAawLEswkox0OxWhh9w= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=v2f4RlJ9t7SDptmTXHBJsAMDBRbbZeif13deMGngMLxpdZzGFsZFk1ShkaQlmYkn+e JtfKuAVBAMXuG47yXq+dcCTp4Frd4KoDE7+4YI/vm7SPdOyFSLfnXr9VmSvEOJ+SkIdj YsSShOauNvVky7YX599fuW6BP7/2y4mLLFTu4= MIME-Version: 1.0 Received: by 10.220.77.17 with SMTP id e17mr10841806vck.3.1247662162062; Wed, 15 Jul 2009 05:49:22 -0700 (PDT) In-Reply-To: <37ffc8080907131039u56441a68m32b9cb0d7f5a8ac5@mail.gmail.com> References: <37ffc8080906260720w485c1babq9b0b765c07e9e0ac@mail.gmail.com> <37ffc8080907021116p2cdf679do38c5760151275db6@mail.gmail.com> <37ffc8080907061131l75ff0958x1f3a6c65c26878e@mail.gmail.com> <4A524889.2060102@windwardsolutions.com> <37ffc8080907061339v497fe562g484ebb577e992745@mail.gmail.com> <37ffc8080907101208t484a84aet710f64ad0b28127e@mail.gmail.com> <05FF9EDF-470F-41C4-8382-90DCA56F33CC@apache.org> <4A5A6902.7040200@windwardsolutions.com> <4A5B63B1.9060509@windwardsolutions.com> <37ffc8080907131039u56441a68m32b9cb0d7f5a8ac5@mail.gmail.com> Date: Wed, 15 Jul 2009 09:49:22 -0300 Message-ID: <37ffc8080907150549t7bd4a865j986b87c8292b84e9@mail.gmail.com> Subject: Re: Clustering from DB From: nfantone To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org After updating to the latest revision, everything seems to be working just fine. However, the task I set up to do, user clustering by KMeans, is taking forever to complete: I initiated the job yesterday's morning and it's still running today (an elapsed time of nearly 18hs and counting...). Of course, the main reason behind it it's the huge size of the data set I'm trying to process (a ~60Gb HDFS file), but I'm looking for ways to improve the performance. Would splitting the input file into smaller parts do any difference? Is it even possible to set the Driver in order to use more than one input (right now, I'm specifying a full path to a single file, including its filename)? What about setting a higher number of reducers? Is there any drawbacks to that? Running multiple KMeans' job in several threads? Or perhaps, I'm just doing something wrong and should not be taking this long. Surely, I'm not the first one to encounter this running time issue with large datasets. Ideas, anyone? On Mon, Jul 13, 2009 at 2:39 PM, nfantone wrote: > Great work. It works like a charm now. Thank you very much. > > On Mon, Jul 13, 2009 at 1:41 PM, Jeff Eastman= wrote: >> r793620 fixes the KMeansDriver.isConverged() method to iterate over all >> cluster part files. Unit test now runs without error and the synthetic >> control job completes too. >> >> >> Jeff Eastman wrote: >>> >>> In this case, the code should be reading all of the clusters into memor= y >>> to see if they have all converged. These may be split into multiple par= t >>> files if more than one reducer is specified. So /* is the correct file >>> pattern and it is the calling site that should remove the /part-0000 >>> reference. The code in isConverged should loop through all the parts, >>> returning if they have all converged or not. >>> >>> I'll take a detailed look tomorrow. >>> >>> >>> Grant Ingersoll wrote: >>>> >>>> Hmm, that might be a mistake on my part when trying to resolve how Had= oop >>>> 0.20 now resolves globs. =C2=A0I somewhat blindly applied "/*" where n= eeded, but >>>> I think it is likely worth revistiing here where a specific file is ne= eded? >>>> >>>> -Grant >>>> >>>> On Jul 10, 2009, at 3:08 PM, nfantone wrote: >>>> >>>>> This error is still bugging me. The exception: >>>>> >>>>> WARNING: java.io.FileNotFoundException: File >>>>> output/clusters-0/part-00000/* does not exist. >>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/* >>>>> does not exist. >>>>> >>>>> ocurrs first at: >>>>> >>>>> >>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDr= iver.java:298) >>>>> >>>>> which corresponds to: >>>>> >>>>> =C2=A0private static boolean isConverged(String filePath, JobConf con= f, >>>>> FileSystem fs) >>>>> =C2=A0 =C2=A0 throws IOException { >>>>> =C2=A0 Path outPart =3D new Path(filePath + "/*"); >>>>> =C2=A0 SequenceFile.Reader reader =3D new SequenceFile.Reader(fs, out= Part, >>>>> conf); =C2=A0<-- THIS >>>>> =C2=A0 ... >>>>> =C2=A0} >>>>> >>>>> where isConverged() is called in this fashion: >>>>> >>>>> return isConverged(clustersOut + "/part-00000", conf, fs); >>>>> >>>>> by runIteration(), which is previously invoked by runJob() like: >>>>> >>>>> =C2=A0 =C2=A0String clustersOut =3D output + "/clusters-" + iteration= ; >>>>> =C2=A0 =C2=A0 converged =3D runIteration(input, clustersIn, clustersO= ut, >>>>> measureClass, >>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 delta, numReduceTasks, iteration); >>>>> >>>>> Consequently, assuming its the first iteration and the output folder >>>>> has been named "output" by the user, the SequenceFile.Reader receives >>>>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I >>>>> believe the path should end in "part-00000" and the =C2=A0+ "/*" shou= ld be >>>>> removed... although someone, evidently, thought otherwise. >>>>> >>>>> Any feedback? >>>>> >>>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone wrote: >>>>>> >>>>>> I was using Canopy to create input clusters, but the error appeared >>>>>> while running kMeans (if I run kMeans' job only with previously >>>>>> created clusters from Canopy placed in output/canopies as initial >>>>>> clusters, it still fails). I noticed no other problems. I was using >>>>>> revision 790979 before updating. =C2=A0Strangely, there were no chan= ges in >>>>>> the job and drivers class from that revision. svn diff shows that th= e >>>>>> only classes that changed in org.apache.mahout.clustering.kmeans >>>>>> package were KMeansInfo.java and RandomSeedGenerator.java >>>>>> >>>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff >>>>>> Eastman wrote: >>>>>>> >>>>>>> Hum, no, it's looking for the output of the first iteration. Were >>>>>>> there >>>>>>> other errors? What was the last revision you were running? It does >>>>>>> look like >>>>>>> something got horked, as it should be looking for output/clusters-0= /*. >>>>>>> Can >>>>>>> you diff the job and driver class to see what changed? >>>>>>> >>>>>>> Jeff >>>>>>> >>>>>>> nfantone wrote: >>>>>>>> >>>>>>>> Fellows, today I updated to revision 791558 and while running kMea= ns >>>>>>>> I >>>>>>>> got the following exception: >>>>>>>> >>>>>>>> WARNING: java.io.FileNotFoundException: File >>>>>>>> output/clusters-0/part-00000/* does not exist. >>>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/* >>>>>>>> does not exist. >>>>>>>> >>>>>>>> The algorithm isn't interrupted, though. But this exception wasn't >>>>>>>> thrown before the update and, to me, its message is not quite clea= r. >>>>>>>> It seems as it's looking for any file inside a "part-00000" >>>>>>>> directory, >>>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are defau= lt >>>>>>>> names for output files. >>>>>>>> >>>>>>>> I could show the entire stack trace, if needed. Any pointers? >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone wrote= : >>>>>>>> >>>>>>>>> >>>>>>>>> Thanks for the feedback, Jeff. >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> The logical format of input to KMeans is as it is = in >>>>>>>>>> sequence >>>>>>>>>> file format, but the Key is never used. To my knowledge, there i= s >>>>>>>>>> no >>>>>>>>>> requirement to assign identifiers to the input points*. Users ar= e >>>>>>>>>> free >>>>>>>>>> to >>>>>>>>>> associate an arbitrary name field with each vector - also label >>>>>>>>>> mappings >>>>>>>>>> may >>>>>>>>>> be assigned - but these are not manipulated by KMeans or any of = the >>>>>>>>>> other >>>>>>>>>> clustering applications. The name field is now used as a vector >>>>>>>>>> identifier >>>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output s= tep >>>>>>>>>> only. >>>>>>>>>> >>>>>>>>> >>>>>>>>> The key may not be used internally, but externally they can prove= to >>>>>>>>> be pretty useful. For me, keys are userIDs and each Vector >>>>>>>>> represents >>>>>>>>> his/her historical behavior. Being able to collect the output >>>>>>>>> information as is quite neat as it allows me = to, >>>>>>>>> for instance, retrieve user information using data directly from = a >>>>>>>>> HDFS file's field. >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >>>> -------------------------- >>>> Grant Ingersoll >>>> http://www.lucidimagination.com/ >>>> >>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) usi= ng >>>> Solr/Lucene: >>>> http://www.lucidimagination.com/search >>>> >>>> >>>> >>> >>> >>> >> >> >