pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao Wang (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-1198) [zebra] performance improvements
Date Fri, 26 Feb 2010 01:00:32 GMT

     [ https://issues.apache.org/jira/browse/PIG-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chao Wang updated PIG-1198:

Patch reviewed.

Some feedbacks:

1) in fillRowSplit() method, reader.close() should always be called at the end;

2) in mapreduce.TableInputFormat.getRowSplits(), batchSize variable is not needed.

Patch looks good overall +1

> [zebra] performance improvements
> --------------------------------
>                 Key: PIG-1198
>                 URL: https://issues.apache.org/jira/browse/PIG-1198
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Yan Zhou
>            Assignee: Yan Zhou
>             Fix For: 0.7.0
>         Attachments: PIG-1198.patch, PIG-1198.patch
> Current input split generation is row-based split on individual TFiles. This leaves undesired
fact that even for TFiles smaller than one block one split is still generated for each. Consequently,
there will be many mappers, and many waves, needed to handle the many small TFiles generated
by as many mappers/reducers that wrote the data. This issue can be addressed by generating
input splits that can include multiple TFiles. 
> For sorted tables, key distribution generation by table, which is used to generated proper
input splits, includes key distributions from column groups even they are not in projection.
This incurs extra cost to perform unnecessary computations and, more inappropriately, creates
unreasonable results on input split generations; 
> For unsorted tables, when row split is generated on a union of tables, the FileSplits
are generated for each table and then lumped together to form the final list of splits to
Map/Reduce. This has a undesirable fact that number of splits is subject to the number of
tables in the table union and not just controlled by the number of splits used by the Map/Reduce
> The input split's goal size is calculated on all column groups even if some of them are
not in projection; 
> For input splits of multiple files in one column group, all files are opened at startup.
This is unnecessary and takes unnecessarily resources from start to end. The files should
be opened when needed and closed when not; 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message