hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Ding (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-978) ERROR 2100 (hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist) and ERROR 2999: (Unexpected internal error. null) when using Multi-Query optimization
Date Mon, 26 Oct 2009 21:56:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770227#action_12770227
] 

Richard Ding commented on PIG-978:
----------------------------------

In Pig Latin Manual, this is called "Implicit Dependencies in Multi-Query Execution": 

{quote}
*Implicit Dependencies*

If a script has dependencies on the execution order outside of what Pig knows about, execution
may fail. For instance, in this script MYUDF might try to read from out1, a file that A was
just stored into. However, Pig does not know that MYUDF depends on the out1 file and might
submit the jobs producing the out2 and out1 files at the same time. To make the script work
(to ensure that the right execution order is enforced) add the exec statement. The exec statement
will trigger the execution of the statements that produce the out1 file.
{quote}

The Pig script in this Jira shows another form of those "implicit dependencies" in multi-query
scripts. Namely, the store/load operators have different file paths, but the load operator
actually depends the store operator. An exec statement should be inserted between the store
and load statements to ensure the right execution order is enforced.

> ERROR 2100 (hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist) and ERROR
2999: (Unexpected internal error. null) when using Multi-Query optimization
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-978
>                 URL: https://issues.apache.org/jira/browse/PIG-978
>             Project: Pig
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.6.0
>
>
> I have  Pig script of this form.. which I execute using Multi-query optimization.
> {code}
> A = load '/user/viraj/firstinput' using PigStorage();
> B = group ....
> C = ..agrregation function
> store C into '/user/viraj/firstinputtempresult/days1';
> ..
> Atab = load '/user/viraj/secondinput' using PigStorage();
> Btab = group ....
> Ctab = ..agrregation function
> store Ctab into '/user/viraj/secondinputtempresult/days1';
> ..
> E = load '/user/viraj/firstinputtempresult/' using PigStorage();
> F = group 
> G = aggregation function
> store G into '/user/viraj/finalresult1';
> Etab = load '/user/viraj/secondinputtempresult/' using PigStorage();
> Ftab = group 
> Gtab = aggregation function
> store Gtab into '/user/viraj/finalresult2';
> {code}
> 2009-07-20 22:05:44,507 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2100:
hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist. Details at logfile: /homes/viraj/pigscripts/pig_1248127173601.log)
 
> is due to the mismatch of store/load commands. The script first stores files into the
'days1' directory (store C into '/user/viraj/firstinputtempresult/days1' using PigStorage();),
but it later loads from the top level directory (E = load '/user/viraj/firstinputtempresult/'
using PigStorage()) instead of the original directory (/user/viraj/firstinputtempresult/days1).
> The current multi-query optimizer can't solve the dependency between these two commands--they
have different load file paths. So the jobs will run concurrently and result in the errors.
> The solution is to add 'exec' or 'run' command after the first two stores . This will
force the first two store commands to run before the rest commands.
> It would be nice to see this fixed as a part of an enhancement to the Multi-query. We
either disable the Multi-query or throw a warning/error message, so that the user can correct
his load/store statements.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message