hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-958) Splitting output data on key field
Date Thu, 24 Sep 2009 00:49:16 GMT

    [ https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758952#action_12758952
] 

Pradeep Kamath commented on PIG-958:
------------------------------------

In general the patch looks good - do address the hudson QA issues. Some code review comments:

1) I don't think pig.properties changes should be part of this patch - was this accidental?
Likewise for Main.java.

2)
{noformat}
 72   /**                                                                                
                                                                                         
                                                       
 73    *  storeUID is needed to get the around the case of multi-query optimisation where
                                                                                         
                                                       
 74    *  multiple instances of MultiStorage could be running in a single mapper/reducer 
                                                                                         
                                                       
 75    *  and run the risk of overwriting each other's output.                           
                                                                                         
                                                       
 76    */      
{noformat}
Wouldn't the multiple instances be writing to different locations in which case there should
be no race condition right?

3)
{noformat}
172   private void initJobSpecificParams() throws IOException {                          
                                                                                         
                                                       
173     if (partition == null || outputPath == null) {   
{noformat}
Wouldn't the outputPath never be null since it is initialized in the constructor?

4) Consider removing the log.debug() since these are all in the inner loop and would possibly
impact performance.

5)
{noformat}
291         String fieldValueBasedPathStr = fieldValueBasedPath.toUri().getPath(); 
{noformat}
This variable is only really used in a log.debug(), I think removing this and using fieldValueBasedPath
in all the fs.create() will make the code shorter and cleaner.

6) Some comments on why the move of the output dirs is needed and in the moveResults() and
removePart() methods would be helpful. Additional comments on the code flow would also help


> Splitting output data on key field
> ----------------------------------
>
>                 Key: PIG-958
>                 URL: https://issues.apache.org/jira/browse/PIG-958
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.4.0
>            Reporter: Ankur
>         Attachments: 958.v2.patch
>
>
> Pig users often face the need to split the output records into a bunch of files and directories
depending on the type of record. Pig's SPLIT operator is useful when record types are few
and known in advance. In cases where type is not directly known but is derived dynamically
from values of a key field in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message