Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 56DBD9B77 for ; Thu, 8 Dec 2011 08:13:12 +0000 (UTC) Received: (qmail 10856 invoked by uid 500); 8 Dec 2011 08:13:08 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 10753 invoked by uid 500); 8 Dec 2011 08:13:08 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 10740 invoked by uid 99); 8 Dec 2011 08:13:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Dec 2011 08:13:05 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of devaraj.k@huawei.com designates 119.145.14.64 as permitted sender) Received: from [119.145.14.64] (HELO szxga01-in.huawei.com) (119.145.14.64) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Dec 2011 08:12:57 +0000 Received: from huawei.com (szxga05-in [172.24.2.49]) by szxga05-in.huawei.com (iPlanet Messaging Server 5.2 HotFix 2.14 (built Aug 8 2006)) with ESMTP id <0LVV00GXKLFDVS@szxga05-in.huawei.com> for common-user@hadoop.apache.org; Thu, 08 Dec 2011 16:11:37 +0800 (CST) Received: from szxrg01-dlp.huawei.com ([172.24.2.119]) by szxga05-in.huawei.com (iPlanet Messaging Server 5.2 HotFix 2.14 (built Aug 8 2006)) with ESMTP id <0LVV007KFLF9YM@szxga05-in.huawei.com> for common-user@hadoop.apache.org; Thu, 08 Dec 2011 16:11:37 +0800 (CST) Received: from szxeml207-edg.china.huawei.com ([172.24.2.119]) by szxrg01-dlp.huawei.com (MOS 4.1.9-GA) with ESMTP id AFQ69853; Thu, 08 Dec 2011 16:10:58 +0800 Received: from SZXEML421-HUB.china.huawei.com (10.82.67.160) by szxeml207-edg.china.huawei.com (172.24.2.59) with Microsoft SMTP Server (TLS) id 14.1.323.3; Thu, 08 Dec 2011 16:10:52 +0800 Received: from blrnshtipl5nc (10.18.1.35) by szxeml421-hub.china.huawei.com (10.82.67.160) with Microsoft SMTP Server id 14.1.323.3; Thu, 08 Dec 2011 16:10:56 +0800 Date: Thu, 08 Dec 2011 13:40:44 +0530 From: Devaraj K Subject: RE: OOM Error Map output copy. In-reply-to: <349A4555-E892-4A67-946D-D07C6732E960@cs.washington.edu> X-Originating-IP: [10.18.1.35] To: common-user@hadoop.apache.org Reply-to: devaraj.k@huawei.com Message-id: <01076FA1E1ED423C94B90F46AB086928@china.huawei.com> Organization: Htipl MIME-version: 1.0 X-MIMEOLE: Produced By Microsoft MimeOLE V6.00.3790.4862 X-Mailer: Microsoft Office Outlook 11 Content-type: text/plain; charset=us-ascii Content-transfer-encoding: 7BIT Thread-index: Acy1EYGVadHaChfARBm7RdBw+MKFvQAbsCUg X-CFilter-Loop: Reflected X-Virus-Checked: Checked by ClamAV on apache.org Hi Niranjan, Every thing looks ok as per the info you have given. Can you check in the job.xml file whether these child opts reflecting or any thing else is overwriting this config. 3. mapred.child.java.opts --> -Xms512M -Xmx1536M -XX:+UseSerialGC and also can you tell me which version of hadoop using? Devaraj K -----Original Message----- From: Niranjan Balasubramanian [mailto:niranjan@cs.washington.edu] Sent: Thursday, December 08, 2011 12:21 AM To: common-user@hadoop.apache.org Subject: OOM Error Map output copy. All I am encountering the following out-of-memory error during the reduce phase of a large job. Map output copy failure : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMe mory(ReduceTask.java:1669) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutpu t(ReduceTask.java:1529) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput( ReduceTask.java:1378) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceT ask.java:1310) I tried increasing the memory available using mapped.child.java.opts but that only helps a little. The reduce task eventually fails again. Here are some relevant job configuration details: 1. The input to the mappers is about 2.5 TB (LZO compressed). The mappers filter out a small percentage of the input ( less than 1%). 2. I am currently using 12 reducers and I can't increase this count by much to ensure availability of reduce slots for other users. 3. mapred.child.java.opts --> -Xms512M -Xmx1536M -XX:+UseSerialGC 4. mapred.job.shuffle.input.buffer.percent --> 0.70 5. mapred.job.shuffle.merge.percent --> 0.66 6. mapred.inmem.merge.threshold --> 1000 7. I have nearly 5000 mappers which are supposed to produce LZO compressed outputs. The logs seem to indicate that the map outputs range between 0.3G to 0.8GB. Does anything here seem amiss? I'd appreciate any input of what settings to try. I can try different reduced values for the input buffer percent and the merge percent. Given that the job runs for about 7-8 hours before crashing, I would like to make some informed choices if possible. Thanks. ~ Niranjan.