Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2FAE464B0 for ; Wed, 22 Jun 2011 23:41:23 +0000 (UTC) Received: (qmail 41666 invoked by uid 500); 22 Jun 2011 23:41:20 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 41620 invoked by uid 500); 22 Jun 2011 23:41:19 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 41612 invoked by uid 99); 22 Jun 2011 23:41:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jun 2011 23:41:19 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of hadoopman@gmail.com designates 209.85.160.48 as permitted sender) Received: from [209.85.160.48] (HELO mail-pw0-f48.google.com) (209.85.160.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jun 2011 23:41:13 +0000 Received: by pwj9 with SMTP id 9so1353004pwj.35 for ; Wed, 22 Jun 2011 16:40:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type :content-transfer-encoding; bh=uCKL1g0zHGVbK2nnbywTgtz0psLhtTVH5tX8EXMAMlA=; b=QKnMZ2bxjQ3V8Su4YGY9C8esK40V73xFlsIAg4HA50z8NbncqRw53kRHq1PpmX+DRr w6uOveiNpalOjDvanlGwi3cY8zrpgJm8UA+SzMZ2aJJGTcKoXufq+LGwxYrXOMg95CHz QSLMpOm2feZE7B8AJ4oKH4ITOruU9VBiF6MU4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=ehpeJaEmIy+js9NeTak8mGdnQF8ZljHUyKlekJ6+V2eqX3cLfTJ7eVhgz9KfIlAS2R cLX70jrW/NbjeJ1OE3lBNu+tBjuXMBlZPxpMHTqOsT/EE9XeoHAc/QmiSNmIDSqzKjkU 0JCOxEnYd1QAsVR0jJWrmrecfWXSlG/U/R7Es= Received: by 10.142.49.10 with SMTP id w10mr307888wfw.85.1308786052984; Wed, 22 Jun 2011 16:40:52 -0700 (PDT) Received: from [10.20.20.233] (u235sentinel.dsl.xmission.com [166.70.240.70]) by mx.google.com with ESMTPS id k2sm684112wfe.15.2011.06.22.16.40.50 (version=SSLv3 cipher=OTHER); Wed, 22 Jun 2011 16:40:51 -0700 (PDT) Message-ID: <4E027D80.9010604@gmail.com> Date: Wed, 22 Jun 2011 17:40:48 -0600 From: hadoopman User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.11) Gecko/20100713 Thunderbird/3.0.6 MIME-Version: 1.0 To: common-user@hadoop.apache.org Subject: Re: OutOfMemoryError: GC overhead limit exceeded References: <39F013D0-CD58-4D3E-8916-AC1484AC4001@gmail.com> In-Reply-To: <39F013D0-CD58-4D3E-8916-AC1484AC4001@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit I've run into similar problems in my hive jobs and will look at the 'mapred.child.ulimit' option. One thing that we've found is when loading data with insert overwrite into our hive tables we've needed to include a 'CLUSTER BY' or 'DISTRIBUTE BY' option. Generally that's fixed our memory issues during the reduce phase. But not 100% of the time (but close). I understand the basics as to what those options do but I'm unclear as to "why" they are necessary (coming from an Oracle and Postgres DBA background). I'm guessing it has to do something with the underlying code. On 06/18/2011 12:28 PM, Mapred Learn wrote: > Did u try playing with mapred.child.ulimit along with java.opts ? > > Sent from my iPhone > > On Jun 18, 2011, at 9:55 AM, Ken Williams wrote: > > >> Hi All, >> >> I'm having a problem running a job on Hadoop. Using Mahout, I've been able to run several Bayesian classifiers and train and test them successfully on increasingly large datasets. Now I'm working on a dataset of 100,000 documents (size 100MB). I've trained the classifier on 80,000 docs and am using the remaining 20,000 as the test set. I've been able to train the classifier but when I try to 'testclassifier' all the map tasks are failing with a 'Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded' exception, before the job itself is 'Killed'. I have a small cluster of 3 machines but have plenty of memory and CPU power (3 x 16GB, 2.5GHz quad-core machines). >> I've tried setting 'mapred.child.java.opts' flags up to 3GB in size (-Xms3G -Xmx3G) but still get the same error. I've also tried setting HADOOP_HEAPSIZE at values like 2000, 2500 and 3000 but this made no difference. When the program is running I can use 'top' to see that although the CPUs are busy, memory usage rarely goes above 12GB and absolutely no swapping is taking place. (see Program console output: http://pastebin.com/0m2Uduxa, Job config file: http://pastebin.com/4GEFSnUM). >> I found a similar problem with a 'GC overhead limit exceeded' where the program was spending so much time garbage-collecting (more then 90% of its time!) that it was unable to progress and so threw the 'GC overhead limit exceeded' exception. If I set (-XX:-UseGCOverheadLimit) in the 'mapred.child.java.opts' property to avoid this exception then I see the same behaviour as before only a slightly different exception is thrown, Caused by: java.lang.OutOfMemoryError: Java heap space at java.nio.HeapCharBuffer.(HeapCharBuffer.java:39) >> So I'm guessing that maybe my program is spending too much time garbage-collecting for it to progress ? But how do I fix this ? There's no further info in the log-files other than seeing the exceptions being thrown. I tried to reduce the 'dfs.block.size' parameter to reduce the amount of data going into each 'map' process (and therefore reduce it's memory requirements) but this made no difference. I tried various settings for JVM reuse (mapred.job.reuse.jvm.num.tasks)using values for no re-use (0), limited re-use (10), and unlimited re-use (-1) but no difference. I think the problem is in the job configuration parameters but how do I find it ? I'm using Hadoop 0.20.2 and the latest Mahout snapshot version. All machines are running 64-bit Ubuntu and Java 6.Any help would be very much appreciated, >> >> Ken Williams >> >> >> >> >> >> >> >> >> >