mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <shashik...@gmail.com>
Subject Re: Failure to run Clustering example
Date Mon, 11 May 2009 13:38:52 GMT
On Wed, May 6, 2009 at 6:45 AM, Grant Ingersoll <gsingers@apache.org> wrote:
>
>>
>> 2. To create canopies for 1000 documents it took almost 75 minutes.
>> Though the total number of unique terms in the index is 50,000 each
>> vector has less than 100 unique terms. (ie each document vector is a
>> sparse vector of cardinality 50,000 and 100 elements.) The hardware is
>> admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor.
>> Hadoop has one node.  Values of T1 and T2 were 80 and 55 respectively,
>> as given in the sample program.
>
> Have you profiled it?  Would be good to see where the issue is coming from.
>

Apologies for reverting late.

I ran clustering on 100 documents with profile flag in hadoop set to
true. Canopy mapper took an hour and Reducer took 32 mins to generate
these results.  The Canopy Clustering job is yet to finish. Here are
the relevant outputs.

Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/profile.out  (Mapper)
rank   self  accum     bytes objs     bytes  objs trace name
    1 84.51% 84.51%  99614736    1  99614736     1 304249 byte[]
    2  5.53% 90.05%   6522848 407678 3336600480 208537530 304697
java.lang.Integer
    3  3.34% 93.38%   3932176    1   3932176     1 304252 int[]
    4  3.03% 96.41%   3567216 222951 690373248 43148328 305480 java.lang.Integer
    5  1.11% 97.52%   1310736    1   1310736     1 304250 int[]

Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/profile.out (Mapper)
rank   self  accum     bytes objs     bytes  objs trace name
    1 77.67% 77.67%  99614736    1  99614736     1 304245 byte[]
    2 10.66% 88.33%  13676528 854783 2037966768 127372923 304840
java.lang.Integer
    3  5.58% 93.91%   7158048 447378 359948080 22496755 305451 java.lang.Integer
    4  3.07% 96.98%   3932176    1   3932176     1 304274 int[]
    5  1.02% 98.00%   1310736    1   1310736     1 304272 int[]


Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/profile.out (Mapper)
rank   self  accum     bytes objs     bytes  objs trace name
    1 10.16% 10.16%    253112 1594   1140784  6850 300008 char[]
    2  9.07% 19.23%    225936   64    946288   266 300184 byte[]
    3  9.06% 28.29%    225816   64    895128   232 300781 byte[]
    4  2.63% 30.92%     65552    1     65552     1 302380 byte[]
    5  1.97% 32.89%     49048  130    252256   700 300056 byte[]
    6  1.51% 34.39%     37528  260    186896  1229 300086 char[]


Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
 (Reducer)
 rank   self  accum     bytes objs     bytes  objs trace name
    1 12.29% 12.29%    677088 42318 1811526016 113220376 306902
java.lang.Integer
    2 12.25% 24.53%    674816 42176 108428384 6776774 307108 java.lang.Integer
    3 11.52% 36.05%    634696  102   3574600 10233 300008 char[]
    4 10.64% 46.69%    586128 24422   1804296 75179 306879
java.util.HashMap$Entry
    5  7.09% 53.78%    390752 24422   4535616 283476 306878 java.lang.Double
    6  7.06% 60.84%    389248 24328   4519120 282445 306880 java.lang.Integer
    7  3.96% 64.80%    218224   74    359448  2939 303276 byte[]



Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/profile.out  (Mapper)

rank   self  accum     bytes objs     bytes  objs trace name
    1 84.51% 84.51%  99614736    1  99614736     1 304249 byte[]
    2  5.53% 90.05%   6522848 407678 3336600480 208537530 304697
java.lang.Integer
    3  3.34% 93.38%   3932176    1   3932176     1 304252 int[]
    4  3.03% 96.41%   3567216 222951 690373248 43148328 305480 java.lang.Integer
    5  1.11% 97.52%   1310736    1   1310736     1 304250 int[]

Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/profile.out  (Mapper)
rank   self  accum   count trace method
   1 96.85% 96.85%  347772 304838 java.lang.Object.<init>
   2  0.34% 97.18%    1203 305459 java.lang.Integer.hashCode
   3  0.33% 97.51%    1168 304841 java.lang.Integer.hashCode

Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/profile.out (Mapper)
rank   self  accum   count trace method
   1  5.59%  5.59%      32 300866 java.lang.ClassLoader.findBootstrapClass
   2  4.20%  9.79%      24 300859 java.util.zip.ZipFile.read
   3  3.67% 13.46%      21 301341 java.util.TimeZone.getSystemTimeZoneID
   4  2.45% 15.91%      14 300119 java.util.zip.ZipFile.open
   5  2.45% 18.36%      14 301365 java.io.UnixFileSystem.getLength
   6  2.27% 20.63%      13 300857 java.lang.ClassLoader.defineClass1


Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
 (Reducer)
rank   self  accum   count trace method
   1 93.77% 93.77%  236947 304890 java.lang.Object.<init>
   2  1.46% 95.23%    3693 311379 sun.nio.ch.EPollArrayWrapper.epollWait


I also took a heap dump when Mapper was running. 98% of the memory was
used by the byte arrays allocated/referenced in
org.apache.hadoop.mapred.MapTask$MapOutputBuffer

The document vectors for input set (of 100 docs) is available here.
http://docs.google.com/Doc?id=dc5kkrf9_110fqtc63c3

I create canopies with following command.

$bin/hadoop jar ../mahout-examples-0.1.job
org.apache.mahout.clustering.canopy.CanopyClusteringJob test100
output/ org.apache.mahout.utils.EuclideanDistanceMeasure 80 55

The t1, t2 values are the ones which were given for synthetic data
example. Should the values of t1 and t2 affect the runtime
dramatically?

Thanks,

--shashi

Mime
View raw message