hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Groschupf (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-108) PigCombine does not use configure method and therefore de-serialize and instantiate objects with every reduce call
Date Sat, 29 Mar 2008 12:22:45 GMT

    [ https://issues.apache.org/jira/browse/PIG-108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583315#action_12583315

Stefan Groschupf commented on PIG-108:

That surprise me, since de-serialize an Object from a String with each reduce call should
be an significant performance impact. 
The pig context was de-serialized from an String with each reduce, since this was done before
the null check. 
The old code did look like:

    public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter
            throws IOException {

        try {
            PigContext pigContext = (PigContext) ObjectSerializer.deserialize(job.get("pig.pigContext"));
            if (evalPipe == null) {
Thanks Alan for checking this in. I will try to spend some more time next weeks to profile
pig a little more.

> PigCombine does not use configure method and therefore de-serialize and instantiate objects
with every reduce call
> ------------------------------------------------------------------------------------------------------------------
>                 Key: PIG-108
>                 URL: https://issues.apache.org/jira/browse/PIG-108
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.1.0
>            Reporter: Stefan Groschupf
>            Priority: Critical
>             Fix For: 0.1.0
>         Attachments: PIG-108-r639015-v1.patch
> There some significant space for improvement in the PigCombine. 
> In each reduce call some objects are deserialized from the jobConf and also the object
graph is generated again and again. 
> Hadoop garanties to call the configure method before a run through and things like inputCount
can be than cached as fields. 
> During reduce calls the jobConf will not change so re deserialization and instantiation
of all this objects 
> pigContext, evalPipe, inputCount, oc, finalout, esp and so on and so on, makes no sense
from my point of view.
> Not sure how often the PigCombine is used, but it will significant improve performance
if we fix this.
> Was there any reason to do things like this or is that just historical? 
> As soon the test suite is running again, I would be happy to work on a patch if there
is no other options about that. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message