hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-108) PigCombine does not use configure method and therefore de-serialize and instantiate objects with every reduce call
Date Sat, 29 Mar 2008 15:56:25 GMT

    [ https://issues.apache.org/jira/browse/PIG-108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583334#action_12583334
] 

Alan Gates commented on PIG-108:
--------------------------------

Perhaps my test wasn't large enough to show the difference.  The query I ran was:

a = load '/user/pig/tests/data/singlefile/studenttab20m';
b = group a by $0;
c = foreach b generate group, COUNT($1);
dump c;

As the name suggests, there are 20m records in the file.  There are 676 distinct groups in
$0.  I ran it on an 8 machine cluster.  The average time without your changes was 4m50s, with
your changes 4m48s.  

Your change is going to do better as the number of groups increases, so tests with larger
numbers of distinct groups might show a larger performance differential.

Did you do some performance profiling that suggested that this was a bottleneck?

> PigCombine does not use configure method and therefore de-serialize and instantiate objects
with every reduce call
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-108
>                 URL: https://issues.apache.org/jira/browse/PIG-108
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.1.0
>            Reporter: Stefan Groschupf
>            Priority: Critical
>             Fix For: 0.1.0
>
>         Attachments: PIG-108-r639015-v1.patch
>
>
> There some significant space for improvement in the PigCombine. 
> In each reduce call some objects are deserialized from the jobConf and also the object
graph is generated again and again. 
> Hadoop garanties to call the configure method before a run through and things like inputCount
can be than cached as fields. 
> During reduce calls the jobConf will not change so re deserialization and instantiation
of all this objects 
> pigContext, evalPipe, inputCount, oc, finalout, esp and so on and so on, makes no sense
from my point of view.
> Not sure how often the PigCombine is used, but it will significant improve performance
if we fix this.
> Was there any reason to do things like this or is that just historical? 
> As soon the test suite is running again, I would be happy to work on a patch if there
is no other options about that. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message