Return-Path: X-Original-To: apmail-pig-dev-archive@www.apache.org Delivered-To: apmail-pig-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 07F1FDE4D for ; Tue, 19 Mar 2013 04:11:19 +0000 (UTC) Received: (qmail 11787 invoked by uid 500); 19 Mar 2013 04:11:17 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 11611 invoked by uid 500); 19 Mar 2013 04:11:16 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 11555 invoked by uid 500); 19 Mar 2013 04:11:15 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 11543 invoked by uid 99); 19 Mar 2013 04:11:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Mar 2013 04:11:15 +0000 Date: Tue, 19 Mar 2013 04:11:15 +0000 (UTC) From: "Johnny Zhang (JIRA)" To: pig-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (PIG-3170) Pig keeps static references to Hadoop's Context after end of task MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PIG-3170?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Zhang updated PIG-3170: ------------------------------ Attachment: PIG-3170.patch.txt the old patch doesn't clean on trunk already. I just uploaded the new patch= . Will try unit tests on top of it. =20 > Pig keeps static references to Hadoop's Context after end of task > ----------------------------------------------------------------- > > Key: PIG-3170 > URL: https://issues.apache.org/jira/browse/PIG-3170 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.10.0 > Reporter: Cl=C3=A9ment Stenac > Priority: Minor > Attachments: PIG-3170.patch.txt, pig-staticreferences-to-context.= diff > > > Through the PigStatusReporter, and the ProgressableReporter, when a Pig M= R task is done, static references are kept to Hadoop's Context object. > Additionally, the PigCombiner also keeps a static reference, apparently w= ithout using it. > When the JVM is reused between MR tasks, it can cause large memory overco= nsumption, with a peak during the creation of the next task, because while = MR is creating the next task (in MapTask. for example), we have both = contexts (with their associated buffers) allocated at once. > This problem is especially important when using a Combiner, because the R= educeContext of a Combiner contains references to large sort buffers. > The specifics of our case were: > * 20 GB input data, divided in 85 map tasks > * Very simple Pig script: LOAD A, FILTER A, GROUP A, FOREACH group genera= te MAX(field), STORE =20 > * MapR distribution, which automatically computes Xmx for mappers at 800M= B > * At the end of the first task, the ReduceContext contains more than 400M= B of byte[] > * Systematic OOM in MapTask. on subsequent VM reuse > * At least -Xmx1200m was required to get the job to complete > * With attached patch, -Xmx600m is enough > While a workaround by increasing Xmx is possible, I think the large overc= onsumption and the complexity of debugging the issue (because the OOM actua= lly happens at the very beginning of the task, before the first byte of dat= a has been processed) warrants fixing it. > The attached patch makes sure that PigStatusReporter and ProgressableRepo= rter drop their reference to the Context in the cleanup phases of the task. > No new test is included because I don't really think it's possible to wri= te a unit test, the issue being not "binary" -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs For more information on JIRA, see: http://www.atlassian.com/software/jira