hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front
Date Wed, 17 Feb 2010 00:41:27 GMT

     [ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ashutosh Chauhan updated PIG-1216:
----------------------------------

    Attachment: pig-1216_1.patch

bq. Is it ok to call outputSpecs multiple times [...]
Talked with Arun regarding this. In a user supplied OutputFormat, implementation of checkOutputSpecs()
will also be provided by user. So, user needs to make sure this call is idempotent. PigStorage
uses TextOutputFormat for which checkOutputSpecs() is idempotent. We need to document this
fact in user manual.

bq. the test case for validation failure [...]
Done.

bq. import [...]
Done.

Result of test-patch.sh on the patch:
     [exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler
warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of
release audit warnings.

> New load store design does not allow Pig to validate inputs and outputs up front
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1216
>                 URL: https://issues.apache.org/jira/browse/PIG-1216
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>         Attachments: pig-1216.patch, pig-1216_1.patch
>
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of
outputs during parsing to avoid run time failures when inputs don't exist or outputs can't
be overwritten.  The downside to this was that Pig assumed all inputs and outputs were HDFS
files, which made implementation harder for non-HDFS based load and store functions.  In the
load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid
this problem and to make use of the checks already being done in those implementations.  Unfortunately,
for Pig Latin scripts that run more then one MR job, this does not work well.  MR does not
do input/output verification on all the jobs at once.  It does them one at a time.  So if
a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists,
the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed
from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces.
 Pig needs to pass this method enough information that the load function implementer can delegate
to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs()
if s/he decides to.  Since 90% of all load and store functions use HDFS and PigStorage will
also need to, the Pig team should implement a default file existence check on HDFS and make
it available as a static method to other Load/Store function implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message