pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-1932) GFCross should allow the user to set the DEFAULT_PARALLELISM value
Date Thu, 24 Mar 2011 17:49:05 GMT

     [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alan Gates updated PIG-1932:
----------------------------

    Attachment: PIG-1932.patch

Unit tests pass.  Results of test-patch:

     [exec] -1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler
warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     -1 release audit.  The applied patch generated 545 release audit warnings
(more than the trunk's current 544 warnings).
     [exec]

the new release audit warning is because I added a file.

> GFCross should allow the user to set the DEFAULT_PARALLELISM value
> ------------------------------------------------------------------
>
>                 Key: PIG-1932
>                 URL: https://issues.apache.org/jira/browse/PIG-1932
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Alan Gates
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1932.patch
>
>
> The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how
wide to spread the records in a cross.  It is currently hard wired to 96.  There are no comments
in the code on how that value was settled on.  Despite the name, this value is not necessarily
related to the reduce parallelism controlled by the parallel clause.  It controls how many
artificial join key values are generated and how many times each record is duplicated before
going through the join.  The higher it is set the more key values (and thus the less likely
the cross will run out of memory) but also the more times each record is duplicated in the
map phase before being sent to the reduce.  
> We should leave the default value at 96 but allow a property to override this default
and change the value.
> We cannot use a constructor argument here because the use of the UDF is not exposed to
the user, so he has no opportunity to pass a constructor argument to it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message