Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 4823 invoked from network); 11 May 2009 13:39:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 May 2009 13:39:45 -0000 Received: (qmail 6787 invoked by uid 500); 11 May 2009 13:39:44 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 6730 invoked by uid 500); 11 May 2009 13:39:43 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 6720 invoked by uid 99); 11 May 2009 13:39:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 May 2009 13:39:43 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of shashikant@gmail.com designates 209.85.198.238 as permitted sender) Received: from [209.85.198.238] (HELO rv-out-0506.google.com) (209.85.198.238) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 May 2009 13:39:34 +0000 Received: by rv-out-0506.google.com with SMTP id l9so2261986rvb.5 for ; Mon, 11 May 2009 06:39:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=0k4pPNRfcC7wm6VWYApDUSHl14U25kK+bgCW/PB3T0I=; b=dXIkAcShMMBhkwKQi7x+sjW6llJs/rzej8+IwSrWD5CiFmKlZmgmrHQy4FjQFYPie7 +jUB1OXYaaQr6eO+tA25CUGfBp+aB9N0JrgHCYBfOuhZbSEDq5W51lKdd6ewsn19RD63 +NJSe+iocqzxMDKAThrgeVnkw/MKuvjw1tUfw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=h0w5CqjAguGPeSKKiEz3opk4ShYENVoE1OJJaiGhaVSD+NtkYQ0FwluexQ+qApo/eG FAD236TvkN1It1ouvERIRPa7mFs1mZAgUCvfuJyxQmYR8yNJ4lTUPqqaBpvcbyEcJEC+ rsGb4FTmdQvkb76DteIxb0IBKGUrKcv88nGhU= MIME-Version: 1.0 Received: by 10.141.48.10 with SMTP id a10mr3290104rvk.22.1242049152056; Mon, 11 May 2009 06:39:12 -0700 (PDT) In-Reply-To: References: <17469b150904280601i19c734d1icb30862ac5f10c0@mail.gmail.com> <17469b150904291008s69e17f7j1c4ae760095c7e35@mail.gmail.com> <49F88C6E.50505@windwardsolutions.com> <17469b150904291027p30141aadu43a4b580b7114e42@mail.gmail.com> <4A0A31F8-14FA-482D-B375-F368F3A74C4F@apache.org> <17469b150905010509p47c46383gc5af53325e25bc7c@mail.gmail.com> <17469b150905050711r42b749f0k37b4d6c5e91f3cd9@mail.gmail.com> From: Shashikant Kore Date: Mon, 11 May 2009 19:08:52 +0530 Message-ID: <17469b150905110638i1e4a79b0hc63ff310fab7a0e3@mail.gmail.com> Subject: Re: Failure to run Clustering example To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Wed, May 6, 2009 at 6:45 AM, Grant Ingersoll wrote= : > >> >> 2. To create canopies for 1000 documents it took almost 75 minutes. >> Though the total number of unique terms in the index is 50,000 each >> vector has less than 100 unique terms. (ie each document vector is a >> sparse vector of cardinality 50,000 and 100 elements.) The hardware is >> admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor. >> Hadoop has one node. =A0Values of T1 and T2 were 80 and 55 respectively, >> as given in the sample program. > > Have you profiled it? =A0Would be good to see where the issue is coming f= rom. > Apologies for reverting late. I ran clustering on 100 documents with profile flag in hadoop set to true. Canopy mapper took an hour and Reducer took 32 mins to generate these results. The Canopy Clustering job is yet to finish. Here are the relevant outputs. Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/profile.out (Ma= pper) rank self accum bytes objs bytes objs trace name 1 84.51% 84.51% 99614736 1 99614736 1 304249 byte[] 2 5.53% 90.05% 6522848 407678 3336600480 208537530 304697 java.lang.Integer 3 3.34% 93.38% 3932176 1 3932176 1 304252 int[] 4 3.03% 96.41% 3567216 222951 690373248 43148328 305480 java.lang.In= teger 5 1.11% 97.52% 1310736 1 1310736 1 304250 int[] Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/profile.out (Map= per) rank self accum bytes objs bytes objs trace name 1 77.67% 77.67% 99614736 1 99614736 1 304245 byte[] 2 10.66% 88.33% 13676528 854783 2037966768 127372923 304840 java.lang.Integer 3 5.58% 93.91% 7158048 447378 359948080 22496755 305451 java.lang.In= teger 4 3.07% 96.98% 3932176 1 3932176 1 304274 int[] 5 1.02% 98.00% 1310736 1 1310736 1 304272 int[] Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/profile.out (Map= per) rank self accum bytes objs bytes objs trace name 1 10.16% 10.16% 253112 1594 1140784 6850 300008 char[] 2 9.07% 19.23% 225936 64 946288 266 300184 byte[] 3 9.06% 28.29% 225816 64 895128 232 300781 byte[] 4 2.63% 30.92% 65552 1 65552 1 302380 byte[] 5 1.97% 32.89% 49048 130 252256 700 300056 byte[] 6 1.51% 34.39% 37528 260 186896 1229 300086 char[] Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out (Reducer) rank self accum bytes objs bytes objs trace name 1 12.29% 12.29% 677088 42318 1811526016 113220376 306902 java.lang.Integer 2 12.25% 24.53% 674816 42176 108428384 6776774 307108 java.lang.Inte= ger 3 11.52% 36.05% 634696 102 3574600 10233 300008 char[] 4 10.64% 46.69% 586128 24422 1804296 75179 306879 java.util.HashMap$Entry 5 7.09% 53.78% 390752 24422 4535616 283476 306878 java.lang.Doubl= e 6 7.06% 60.84% 389248 24328 4519120 282445 306880 java.lang.Integ= er 7 3.96% 64.80% 218224 74 359448 2939 303276 byte[] Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/profile.out (Ma= pper) rank self accum bytes objs bytes objs trace name 1 84.51% 84.51% 99614736 1 99614736 1 304249 byte[] 2 5.53% 90.05% 6522848 407678 3336600480 208537530 304697 java.lang.Integer 3 3.34% 93.38% 3932176 1 3932176 1 304252 int[] 4 3.03% 96.41% 3567216 222951 690373248 43148328 305480 java.lang.In= teger 5 1.11% 97.52% 1310736 1 1310736 1 304250 int[] Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/profile.out (Ma= pper) rank self accum count trace method 1 96.85% 96.85% 347772 304838 java.lang.Object. 2 0.34% 97.18% 1203 305459 java.lang.Integer.hashCode 3 0.33% 97.51% 1168 304841 java.lang.Integer.hashCode Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/profile.out (Map= per) rank self accum count trace method 1 5.59% 5.59% 32 300866 java.lang.ClassLoader.findBootstrapClass 2 4.20% 9.79% 24 300859 java.util.zip.ZipFile.read 3 3.67% 13.46% 21 301341 java.util.TimeZone.getSystemTimeZoneID 4 2.45% 15.91% 14 300119 java.util.zip.ZipFile.open 5 2.45% 18.36% 14 301365 java.io.UnixFileSystem.getLength 6 2.27% 20.63% 13 300857 java.lang.ClassLoader.defineClass1 Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out (Reducer) rank self accum count trace method 1 93.77% 93.77% 236947 304890 java.lang.Object. 2 1.46% 95.23% 3693 311379 sun.nio.ch.EPollArrayWrapper.epollWait I also took a heap dump when Mapper was running. 98% of the memory was used by the byte arrays allocated/referenced in org.apache.hadoop.mapred.MapTask$MapOutputBuffer The document vectors for input set (of 100 docs) is available here. http://docs.google.com/Doc?id=3Ddc5kkrf9_110fqtc63c3 I create canopies with following command. $bin/hadoop jar ../mahout-examples-0.1.job org.apache.mahout.clustering.canopy.CanopyClusteringJob test100 output/ org.apache.mahout.utils.EuclideanDistanceMeasure 80 55 The t1, t2 values are the ones which were given for synthetic data example. Should the values of t1 and t2 affect the runtime dramatically? Thanks, --shashi