Return-Path: Delivered-To: apmail-pig-dev-archive@www.apache.org Received: (qmail 1479 invoked from network); 2 Mar 2011 00:06:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Mar 2011 00:06:28 -0000 Received: (qmail 87243 invoked by uid 500); 2 Mar 2011 00:06:28 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 87194 invoked by uid 500); 2 Mar 2011 00:06:27 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 87186 invoked by uid 500); 2 Mar 2011 00:06:27 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 87183 invoked by uid 99); 2 Mar 2011 00:06:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2011 00:06:27 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2011 00:06:27 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 0251B4AF1D for ; Wed, 2 Mar 2011 00:05:37 +0000 (UTC) Date: Wed, 2 Mar 2011 00:05:37 +0000 (UTC) From: "Alan Gates (JIRA)" To: pig-dev@hadoop.apache.org Message-ID: <1795461796.6501.1299024337005.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1662850998.6409.1299022117621.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] Updated: (PIG-1875) Keep tuples serialized to limit spilling and speed it when it happens MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PIG-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1875: ---------------------------- Attachment: mrtuple.patch Here's a first pass at what MToRTuple might look like. I've done some basic testing to assure this works, but nothing comprehensive. In test runs where I serialized 100k tuples, wrote them to disk, and read them back I got the following results: DefaultTuple: time to write to disk: 81.93 sec size on disk: 98M time to read from disk: 12.62 sec size in memory (after read): 238M MToRTuple: time to write to disk: 10.49 sec size on disk: 58M time to read from disk: 1.10 sec size in memory (after read): 57M So roughly 1/4 the memory consumption and ~10x speedup on disk reads and writes. > Keep tuples serialized to limit spilling and speed it when it happens > --------------------------------------------------------------------- > > Key: PIG-1875 > URL: https://issues.apache.org/jira/browse/PIG-1875 > Project: Pig > Issue Type: Improvement > Components: impl > Reporter: Alan Gates > Priority: Minor > Attachments: mrtuple.patch > > > Currently Pig reads records off of the reduce iterator and immediately deserializes them into Java objects. This takes up much more memory than serialized versions, thus Pig spills sooner then if it stored them in serialized form. Also, if it does have to spill, it has to serialize them again, and then again deserialize them after reading from the spill file. > We should explore storing them in memory serialized when they are read off of the reduce iterator. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira