From mahout-user-return-1107-apmail-lucene-mahout-user-archive=lucene.apache.org@lucene.apache.org Mon Jul 27 03:59:50 2009 Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 35962 invoked from network); 27 Jul 2009 03:59:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 27 Jul 2009 03:59:50 -0000 Received: (qmail 75348 invoked by uid 500); 27 Jul 2009 04:00:55 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 75253 invoked by uid 500); 27 Jul 2009 04:00:54 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 75243 invoked by uid 99); 27 Jul 2009 04:00:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jul 2009 04:00:54 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nfantone@gmail.com designates 209.85.217.218 as permitted sender) Received: from [209.85.217.218] (HELO mail-gx0-f218.google.com) (209.85.217.218) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jul 2009 04:00:44 +0000 Received: by gxk18 with SMTP id 18so4934606gxk.5 for ; Sun, 26 Jul 2009 21:00:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=9pSR78+HyZIqFJI3MBUmZVV+Ek6zFH2bygFHFcPjlVM=; b=qCpCpy3xB/0pDPOjixnD8V9inCyijLjeeA6yJJtEP9eosg6RVowEH0AQZ2V0Pujx1K UP4G02N3zN69xHNhDGdxOiyS04jYj7wCdLu6oBVBaUcZ3jMgcYk2cctVS3rbR+6WSRnW /lUarI29J8gPF0BzVFjkwdohPyZ3t7Q8ETATk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=aPwFN67k72r52iK98AKFuq2ZiBc1methGcEVG9xuDydw9YRhKvkXgsApZ7mgvckh22 MSt6xp28sL6xa04TQIFTzfCvdy9POM/+EZsfscX9NvWyTShVepREsDjHTiYjuGjd5ozb ZhXOj3gA9Ipg5uGk6c+mrGJSsyALm5vklXogY= MIME-Version: 1.0 Received: by 10.150.198.2 with SMTP id v2mr10168861ybf.3.1248667223981; Sun, 26 Jul 2009 21:00:23 -0700 (PDT) In-Reply-To: <1DE76111-B5C9-4F29-9874-317F398025DA@apache.org> References: <37ffc8080906260720w485c1babq9b0b765c07e9e0ac@mail.gmail.com> <4CB4FAE0-6995-4ECB-ACB1-9A8673B8A032@apache.org> <37ffc8080907230720t4429386s208c5eb8cfce6b24@mail.gmail.com> <8092565B-5EEC-467C-BD25-258A5E5F34C3@apache.org> <37ffc8080907231145u51df3b2u16ada88665515805@mail.gmail.com> <37ffc8080907240514o62a74d39vd034075a4bffd9a9@mail.gmail.com> <37ffc8080907240838y681a4426i16273268c60220c@mail.gmail.com> <37ffc8080907261137q4fe8dc0y87ce4015117b63c1@mail.gmail.com> <5133DC0A-E0E6-4B60-9BC5-62E9DEB92B65@apache.org> <1DE76111-B5C9-4F29-9874-317F398025DA@apache.org> Date: Mon, 27 Jul 2009 01:00:23 -0300 Message-ID: <37ffc8080907262100p54c5bcedqb47c851424e4e615@mail.gmail.com> Subject: Re: Clustering from DB From: nfantone To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Thanks, Grant. I just updated and notice the change. As a side note: you think someone could run some real tests on kMeans, in particular, other than the ones already in the project? I bet there are other naive (or not so naive) problems like that. After much coding, reading and experimenting in the last weeks with clustering in Mahout, I am inclined to say something may not fully work with kMeans, as of now. Or perhaps it just needs some refactoring/performance tweaks. Jeff have claimed to run the job over gigabytes of data, using a rather small cluster, in minutes. Have anyone tried to accomplish this recently (since the hadoop upgrade to 0.20)? Just use ClusteringUtils to write a file of some (arguably not so) significant number of random Vectors (say, 800.000+) and let that be the input of a KMeansMRJob (testKMeansMRJob() could very well serve this purpose with little change). You'll end up with a file of about ~85MB to ~100MB, which can easily fit into memory in any modern computer. Now, run the whole thing (I've tried both, locally and using a three node-cluster setup - which, frankly, seemed like a bit too much computing power for such small number of items in the dataset). It'll take forever to complete. This simple methods could be used to generate any given number of random SparseVectors for testing's sake, if anyone is interested: private static Random rnd =3D new Random(); private static final int CARDINALITY =3D 1200; private static final int MAX_NON_ZEROS =3D 200; private static final int MAX_VECTORS =3D 850000; private static Vector getRandomVector() { Integer id =3D rnd.nextInt(Integer.MAX_VALUE); Vector v =3D new SparseVector(id.toString(), CARDINALITY); int nonZeros =3D 0; while ((nonZeros =3D rnd.nextInt(MAX_NON_ZEROS)) =3D=3D 0); for (int i =3D 0; i < nonZeros; i++) { v.setQuick(rnd.nextInt(CARDINALITY), rnd.nextDouble()); } return v; } private static List getVectors() { List vectors =3D new ArrayList(MAX_VECTORS); for (int i =3D 0; i < MAX_VECTORS; i++){ vectors.add(getRandomVector()); } return vectors; } On Sun, Jul 26, 2009 at 10:30 PM, Grant Ingersoll wrot= e: > Fixed on MAHOUT-152 > > On Jul 26, 2009, at 9:19 PM, Grant Ingersoll wrote: > >> That does indeed look like a problem. =C2=A0I'll fix. >> >> On Jul 26, 2009, at 2:37 PM, nfantone wrote: >> >>> While (still) experiencing performance issues and inspecting kMeans >>> code, I found this lying around SquaredEuclideanDistanceMeasure.java: >>> >>> public double distance(double centroidLengthSquare, Vector centroid, >>> Vector v) { >>> =C2=A0if (centroid.size() !=3D centroid.size()) { >>> =C2=A0 =C2=A0throw new CardinalityException(); >>> =C2=A0} >>> =C2=A0... >>> =C2=A0} >>> >>> I bet someone meant to compare centroid and v sizes and didn't noticed. >>> >>> On Fri, Jul 24, 2009 at 12:38 PM, nfantone wrote: >>>> >>>> Well, as it turned out, it didn't have anything to do with my >>>> performance issue but I found out that writing a Cluster (with a >>>> single vector as its center) to a file and then reading it, requires >>>> the center to be added as point; otherwise, you won't be able to >>>> retrieve it as it should. Therefore, one should do: >>>> >>>> // Writing >>>> String id =3D "someID"; >>>> Vector v =3D new SparseVector(); >>>> Cluster c =3D new Cluster(v); >>>> c.addPoint(v); >>>> seqWriter.append(new Text(id), c); >>>> >>>> // Reading >>>> Writable key =3D (Writable) seqReader.getKeyClass().newInstance(); >>>> Cluster value =3D (Cluster) seqReader.getValueClass().newInstance(); >>>> while (seqReader.next(key, value)) { >>>> ... >>>> Vector centroid =3D value.getCenter(); >>>> ... >>>> } >>>> >>>> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think >>>> this shouldn't happen. Then again, it's not that relevant, I guess. >>>> >>>> Sorry for bringing different subjects to the same thread. >>>> >>>> On Fri, Jul 24, 2009 at 9:14 AM, nfantone wrote: >>>>> >>>>> I've been using RandomSeedGenerator to generate initial clusters for >>>>> kMeans and while checking its code I stumbled upon this: >>>>> >>>>> =C2=A0 =C2=A0while (reader.next(key, value)) { >>>>> =C2=A0 =C2=A0 =C2=A0Cluster newCluster =3D new Cluster(value); >>>>> =C2=A0 =C2=A0 =C2=A0newCluster.addPoint(value); >>>>> =C2=A0 =C2=A0 =C2=A0.... >>>>> =C2=A0 =C2=A0} >>>>> >>>>> I can see it adds the vector to the newly created cluster, even thoug= h >>>>> it is setting it as its center in the constructor. Wasn't this >>>>> corrected in a past revision? I thought this was not necessary >>>>> anymore. I'll look into it a little bit more and see if this has >>>>> something to do with my lack of performance with my dataset. >>>>> >>>>> On Thu, Jul 23, 2009 at 3:45 PM, nfantone wrote: >>>>>>>>> >>>>>>>>> Perhaps a larger convergence value might help (-d, I believe). >>>>>>>> >>>>>>>> I'll try that. >>>>>> >>>>>> There was no significant change while modifying the convergence valu= e. >>>>>> At least, none was observed during the first three iterations which >>>>>> lasted the same amount of time than before, more or less. >>>>>> >>>>>>>>> Is there any chance your data is publicly shareable? =C2=A0Come t= o think >>>>>>>>> of >>>>>>>>> it, >>>>>>>>> with the vector representations, as long as you don't publish the >>>>>>>>> key >>>>>>>>> (which >>>>>>>>> terms map to which index), I would think most all data is publicl= y >>>>>>>>> shareable. >>>>>>>> >>>>>>>> I'm sorry, I don't quite understand what you're asking. Publicly >>>>>>>> shareable? As in user-permissions to access/read/write the data? >>>>>>> >>>>>>> As in post a copy of the SequenceFile somewhere for download, >>>>>>> assuming you >>>>>>> can. =C2=A0Then others could presumably try it out. >>>>>> >>>>>> My bad. Of course it is: >>>>>> >>>>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2 >>>>>> >>>>>> That's the ~62Mb SequenceFile sample I've using, in >>>>> SparseVector> logical format. >>>>>> >>>>>>> That does seem like an awfully long time for 62 MB on a 6 node >>>>>>> cluster. How many >terations are running? >>>>>> >>>>>> I'm running the whole thing with a 20 iterations cap. Every iteratio= n >>>>>> - EXCEPT the first one which, oddly, lasted just two minutes-, took >>>>>> around 3hs to complete: >>>>>> >>>>>> Hadoop job_200907221734_0001 >>>>>> Finished in: 1mins, 42sec >>>>>> >>>>>> Hadoop job_200907221734_0004 >>>>>> Finished in: 2hrs, 34mins, 3sec >>>>>> >>>>>> Hadoop job_200907221734_0005 >>>>>> Finished in: 2hrs, 59mins, 34sec >>>>>> >>>>>>> How did you generate your initial clusters? >>>>>> >>>>>> I generate the initial clusters via the RandomSeedGenerator setting = a >>>>>> 'k' value of 200. =C2=A0This is what I did to initiate the process f= or the >>>>>> first time: >>>>>> >>>>>> ./bin/hadoop dfs -D dfs.block.size=3D4194304 -put ~/user.data >>>>>> input/user.data >>>>>> ./bin/hadoop dfs -D dfs.block.size=3D4194304 -put ~/user.data >>>>>> init/user.data >>>>>> ./bin/hadoop jar ~/mahout-core-0.2.jar >>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data = -c >>>>>> init -o output -r 32 -d 0.01 -k 200 >>>>>> >>>>>>> Where are the iteration jobs spending most of their time (map vs. >>>>>>> reduce) >>>>>> >>>>>> I'm tempted to say map here, but their spent time is rather >>>>>> comparable, actually. Reduce attempts are taking an hour and a half = to >>>>>> end (average), and so are Map attempts. Here are some representative >>>>>> examples from the web UI: >>>>>> >>>>>> reduce >>>>>> attempt_200907221734_0002_r_000006_0 >>>>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec) >>>>>> >>>>>> map >>>>>> attempt_200907221734_0002_m_000000_0 >>>>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec) >>>>>> >>>>>> Perhaps, there's some inconvenient in the way I create the >>>>>> SequenceFile? I could share the JAVA code as well, if required. >>>>>> >>>>> >>>> >> >> -------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com/ >> >> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using >> Solr/Lucene: >> http://www.lucidimagination.com/search >> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > >