pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error
Date Sat, 24 Apr 2010 23:19:50 GMT

    [ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860614#action_12860614

Ashutosh Chauhan commented on PIG-1211:

Oh, I got confused. From your earlier comment, it occurred to me you are saying that we should
add a -checkscript command line option. From your previous comment are you suggesting that
we should add syntax checker which will always run (i.e., without needing any cmd line directive)
before the query starts to execute and thereby catching as many user error as possible. I
think this is a reasonable ask and will be useful to users. This might be the first step towards
making a distinction between pig compile time and run-time explicit to user. If we go full
length here, we might as well do what Milind suggested earlier (and in recent mail thread).
We can add a "compilation" phase which first runs a syntax checker, then generates "object
code" (essentially job jar) from pig script. This compiled object can then be handed over
to run-time (hadoop cluster). Wow, pig-latin is evolving towards a "true language" :)   

> Pig script runs half way after which it reports syntax error
> ------------------------------------------------------------
>                 Key: PIG-1211
>                 URL: https://issues.apache.org/jira/browse/PIG-1211
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.8.0
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4,
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the first store
but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of explain to get
the same syntax error?  In this way I can ensure that I do not run for 3-4 hours before encountering
a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the store directories
are variables
> Thanks
> Viraj

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message