Return-Path: Delivered-To: apmail-pig-dev-archive@www.apache.org Received: (qmail 20354 invoked from network); 2 Mar 2011 21:05:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Mar 2011 21:05:58 -0000 Received: (qmail 43325 invoked by uid 500); 2 Mar 2011 21:05:57 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 43295 invoked by uid 500); 2 Mar 2011 21:05:57 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 43287 invoked by uid 500); 2 Mar 2011 21:05:57 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 43284 invoked by uid 99); 2 Mar 2011 21:05:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2011 21:05:57 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2011 21:05:57 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 7DADF4C528 for ; Wed, 2 Mar 2011 21:05:37 +0000 (UTC) Date: Wed, 2 Mar 2011 21:05:37 +0000 (UTC) From: "Thejas M Nair (JIRA)" To: pig-dev@hadoop.apache.org Message-ID: <1198576162.8934.1299099937511.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1662850998.6409.1299022117621.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] Commented: (PIG-1875) Keep tuples serialized to limit spilling and speed it when it happens MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PIG-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001640#comment-13001640 ] Thejas M Nair commented on PIG-1875: ------------------------------------ This idea is likely to speed up the queries where pig ends up spilling to disk today. But this approach will have larger memory footprint in cases where we would not have ended spilling to disk, if I assume that deserializing more than once is going to be very expensive. Maybe, this can be turned on for a stream once we see a need to spill. The first spill will not not end up using this approach, if we do that. This is hopefully easy to do, but i haven't checked. For example, this approach is not going to be useful for the leftmost stream in a join, it will make sense to keep only the deserialized version in memory. For the other streams when we know we are likely to spill to disk, pig can be more aggressive in destroying the deserialized copy. The bag holding the tuple can be in charge of destroying the deserialized copy. > Keep tuples serialized to limit spilling and speed it when it happens > --------------------------------------------------------------------- > > Key: PIG-1875 > URL: https://issues.apache.org/jira/browse/PIG-1875 > Project: Pig > Issue Type: Improvement > Components: impl > Reporter: Alan Gates > Priority: Minor > Fix For: 0.10 > > Attachments: mrtuple.patch > > > Currently Pig reads records off of the reduce iterator and immediately deserializes them into Java objects. This takes up much more memory than serialized versions, thus Pig spills sooner then if it stored them in serialized form. Also, if it does have to spill, it has to serialize them again, and then again deserialize them after reading from the spill file. > We should explore storing them in memory serialized when they are read off of the reduce iterator. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira