hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs
Date Fri, 09 Jul 2010 08:34:49 GMT

    [ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886647#action_12886647
] 

Hadoop QA commented on PIG-1472:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12449033/PIG-1472.3.patch
  against trunk revision 960062.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 69 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 release audit.  The applied patch generated 395 release audit warnings (more than the
trunk's current 394 warnings).

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/343/console

This message is automatically generated.

> Optimize serialization/deserialization between Map and Reduce and between MR jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in serializing/deserializing
(sedes) records between Map and Reduce and between MR jobs. 
> For example, if PigMix queries are modified to specify types for all the fields in the
load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records
with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime
increase of few times has been seen.
> There are a few optimizations that have shown to improve the performance of sedes in
my tests -
> 1. Use smaller number of bytes to store length of the column . For example if a bytearray
is smaller than 255 bytes , a byte can be used to store the length instead of the integer
that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF.
 This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization
format that these loaders use cannot change, so after the optimization their format is going
to be different from the format used between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message