pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolsoo Park (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PIG-3142) Fixed-width load and store functions for the Piggybank
Date Thu, 28 Feb 2013 02:43:14 GMT

    [ https://issues.apache.org/jira/browse/PIG-3142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589116#comment-13589116
] 

Cheolsoo Park edited comment on PIG-3142 at 2/28/13 2:42 AM:
-------------------------------------------------------------

[~jpacker], thank you very much for the patch! This looks very useful. I have a few minor
comments:
* Can you move the following code to prepareToRead() and remove requiredFieldsInitialized?
The comment in LoadFunc says "This (prepareToRead) will be called during execution before
any calls to getNext", so I think this kind of initialization should be done there not in
getNext().
{code:title=FixedWidthLoader.java}
+        if (!requiredFieldsInitialized) {
+            UDFContext udfc = UDFContext.getUDFContext();
+            Properties p = udfc.getUDFProperties(this.getClass(), new String[] { udfContextSignature
});
+            requiredFields = (boolean[]) ObjectSerializer.deserialize(p.getProperty(REQUIRED_FIELDS_SIGNATURE));
+
+            if (requiredFields != null) {
+                numRequiredFields = 0;
+                for (int i = 0; i < requiredFields.length; i++) {
+                    if (requiredFields[i])
+                        numRequiredFields++;
+                }
+            }
+            
+            requiredFieldsInitialized = true;
+        }
{code}
In fact, I found almost all LoadFunc implementations (including PigStorage) do the same trick.
Looking at the commit history, I believe that this is a legacy pattern that was introduced
before the LoadFunc resign. It seems that we never cleaned up this pattern when merging the
LoadFunc redesign to trunk. At least, let's stop using it from now on. 
* Can you update the usage messages with Apache Pig namespace?
{code:title=FixedWidthLoader.java}
"Usage: com.mortardata.pig.FixedWidthLoader(" +
{code}
Also, the usage message in FixedWidthStorage incorrectly refers to FixedWidthLoader.
{code:title=FixedWidthStorage.java}
"Usage: com.mortardata.pig.FixedWidthLoader(" +
{code}
* Can you factor out parseColumnSpec(String spec) and reuse it in both FixedWidthLoader and
FixedWidthStorage instead have duplicate code?
* Can you rename FixedWidthStorage to FixedWidthStorer? I think Storage refers to Loader +
Storer (e.g. PigStorage, AvroStorage, etc). So FixedWidthStorer is probably a better name
(e.g. HCatStorer).

Please let me know what you think. Thanks!
                
      was (Author: cheolsoo):
    [~jpacker], thank you very much for the patch! This looks very useful. I have a few minor
comments:
* Can you move the following code to prepareToRead() and remove requiredFieldsInitialized?
The comment in LoadFunc says "This will be called during execution before any calls to getNext()",
so I think this kind of initialization should be done there not in getNext().
{code:title=FixedWidthLoader.java}
+        if (!requiredFieldsInitialized) {
+            UDFContext udfc = UDFContext.getUDFContext();
+            Properties p = udfc.getUDFProperties(this.getClass(), new String[] { udfContextSignature
});
+            requiredFields = (boolean[]) ObjectSerializer.deserialize(p.getProperty(REQUIRED_FIELDS_SIGNATURE));
+
+            if (requiredFields != null) {
+                numRequiredFields = 0;
+                for (int i = 0; i < requiredFields.length; i++) {
+                    if (requiredFields[i])
+                        numRequiredFields++;
+                }
+            }
+            
+            requiredFieldsInitialized = true;
+        }
{code}
In fact, I found almost all LoadFunc implementations (including PigStorage) do the same trick.
Looking at the commit history, I believe that this is a legacy pattern that was introduced
before the LoadFunc resign. It seems that we never cleaned up this pattern when merging the
LoadFunc redesign to trunk. At least, let's stop using it from now on. 
* Can you update the usage messages with Apache Pig namespace?
{code:title=FixedWidthLoader.java}
"Usage: com.mortardata.pig.FixedWidthLoader(" +
{code}
Also, the usage message in FixedWidthStorage incorrectly refers to FixedWidthLoader.
{code:title=FixedWidthStorage.java}
"Usage: com.mortardata.pig.FixedWidthLoader(" +
{code}
* Can you factor out parseColumnSpec(String spec) and reuse it in both FixedWidthLoader and
FixedWidthStorage instead have duplicate code?
* Can you rename FixedWidthStorage to FixedWidthStorer? I think Storage refers to Loader +
Storer (e.g. PigStorage, AvroStorage, etc). So FixedWidthStorer is probably a better name
(e.g. HCatStorer).

Please let me know what you think. Thanks!
                  
> Fixed-width load and store functions for the Piggybank
> ------------------------------------------------------
>
>                 Key: PIG-3142
>                 URL: https://issues.apache.org/jira/browse/PIG-3142
>             Project: Pig
>          Issue Type: New Feature
>          Components: piggybank
>    Affects Versions: 0.11
>            Reporter: Jonathan Packer
>         Attachments: fixed-width.patch, fixed-width-updated.patch
>
>
> Adds load/store functions for fixed width data to the Piggybank. They use the syntax
of the unix "cut" command to specify column positions, and have an option to skip the header
row when loading or to write a header row when storing.
> The header handling works properly with multiple small files each with a header being
combined into one split, or a large file with a single header being split into multiple splits.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message