hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1198) [zebra] performance improvements
Date Fri, 26 Feb 2010 08:18:28 GMT

    [ https://issues.apache.org/jira/browse/PIG-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838747#action_12838747

Hadoop QA commented on PIG-1198:

-1 overall.  Here are the results of testing the latest attachment 
  against trunk revision 916429.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/225/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/225/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/225/console

This message is automatically generated.

> [zebra] performance improvements
> --------------------------------
>                 Key: PIG-1198
>                 URL: https://issues.apache.org/jira/browse/PIG-1198
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Yan Zhou
>            Assignee: Yan Zhou
>             Fix For: 0.7.0
>         Attachments: PIG-1198.patch, PIG-1198.patch, PIG-1198.patch
> Current input split generation is row-based split on individual TFiles. This leaves undesired
fact that even for TFiles smaller than one block one split is still generated for each. Consequently,
there will be many mappers, and many waves, needed to handle the many small TFiles generated
by as many mappers/reducers that wrote the data. This issue can be addressed by generating
input splits that can include multiple TFiles. 
> For sorted tables, key distribution generation by table, which is used to generated proper
input splits, includes key distributions from column groups even they are not in projection.
This incurs extra cost to perform unnecessary computations and, more inappropriately, creates
unreasonable results on input split generations; 
> For unsorted tables, when row split is generated on a union of tables, the FileSplits
are generated for each table and then lumped together to form the final list of splits to
Map/Reduce. This has a undesirable fact that number of splits is subject to the number of
tables in the table union and not just controlled by the number of splits used by the Map/Reduce
> The input split's goal size is calculated on all column groups even if some of them are
not in projection; 
> For input splits of multiple files in one column group, all files are opened at startup.
This is unnecessary and takes unnecessarily resources from start to end. The files should
be opened when needed and closed when not; 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message