Return-Path: X-Original-To: apmail-pig-dev-archive@www.apache.org Delivered-To: apmail-pig-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 99F0D18676 for ; Wed, 12 Aug 2015 19:19:52 +0000 (UTC) Received: (qmail 7219 invoked by uid 500); 12 Aug 2015 19:19:46 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 7163 invoked by uid 500); 12 Aug 2015 19:19:46 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 7142 invoked by uid 500); 12 Aug 2015 19:19:45 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 7138 invoked by uid 99); 12 Aug 2015 19:19:45 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Aug 2015 19:19:45 +0000 Date: Wed, 12 Aug 2015 19:19:45 +0000 (UTC) From: "Rohini Palaniswamy (JIRA)" To: pig-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694035#comment-14694035 ] Rohini Palaniswamy commented on PIG-1472: ----------------------------------------- Thanks [~thejas]. Created PIG-4656 to move to WritableUtils.writeVInt. Where type also denotes size, will keep as is. For eg: TUPLE_0 to TUPLE_9 will stay as that packs type and size into one byte. But with TINYTUPLE, SMALLTUPLE and TUPLE - only TUPLE will be retained converting to WritableUtils.writeVInt. > Optimize serialization/deserialization between Map and Reduce and between MR jobs > --------------------------------------------------------------------------------- > > Key: PIG-1472 > URL: https://issues.apache.org/jira/browse/PIG-1472 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.8.0 > Reporter: Thejas M Nair > Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, PIG-1472.patch > > > In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. > For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen. > There are a few optimizations that have shown to improve the performance of sedes in my tests - > 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used. > 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF. This reduces the cost of serialization by more than 1/2. > Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)