hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shravan Matthur Narayanamurthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-157) Add types and rework execution pipeline
Date Wed, 07 May 2008 22:15:55 GMT

    [ https://issues.apache.org/jira/browse/PIG-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595061#action_12595061

Shravan Matthur Narayanamurthy commented on PIG-157:

Thanks for the comments Pi.

1) First concern is that using Hadoop Local will tie us to Hadoop too much.
There was an initiative quite a while ago to start looking at different backends other than
Hadoop (e.g. we might be running a backend like SETI@home. Who knows?).

However, this whole thing seems to have been built for solely Hadoop anyway. Not sure about
the current direction.
[shrav] I don't think this ties us down to Hadoop in the sense that we can't have other backends.
We just resue some hadoop code thats all. The only thing I see tied to haddop is that at max
we would need to supply the hadoop jar with pig which we already do.

2) Have you tried to measure LocalHadoop startup time compared to the local engine? If the
LocalHadoop takes much more time to startup, we might suffer when processing nested queries.
[shrav] The LoaclHadoop has a startup time of about 6 secs. But if we are processing even
like 10 MB of data, the LocalHadoop mysteriously beats the local engine hands down. For the
local engine I presumed that it would just take the leaf operator which will be a POStore
and call the store() method.
For about 12MB of data, the LocalHadoop took about 11 sec whereas the local engine took about
15 sec.

As far as the nested plan in foreach goes, at least currently, we won't be creating an instance
of a local engine to run the nested plan. Currently, all operators that can be used inside
the nested plan have been implemented such that the generic plan execution model with attachInputs
called on the inner plan will work fine. However, if we decide to have all the operators inside
the nested plan, then we will have to do changes to the MRCompiler so that the nested foreach
becomes a blocking operator and should be handled separately by spawning new MR jobs to process
the plan inside. In this case, invoking LocalHadoop would probably not make sense. The executable
operator plan is a better option here as it would also entail that there would not be any
changes to the MRCompiler. 

So, at least now, LocalJobRunner will not be invoked inside the MapReduce execution for executing
nested plans. The LocalJobRunner will be strictly used only when the user is in local execution

I will update the wiki with these comments.
Thanks for the inputs Pi. I had not thought about the nested for each when it grows full blown.

> Add types and rework execution pipeline
> ---------------------------------------
>                 Key: PIG-157
>                 URL: https://issues.apache.org/jira/browse/PIG-157
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: Core.patch.zip, exceptions.patch, incr1.zip
> This is the tracking bug for the work to add types to pig and rework the execution pipeline.
 Individual components of this work are covered in subtasks.
> Functional and design specs for this work are:
> http://wiki.apache.org/pig/PigTypesFunctionalSpec
> http://wiki.apache.org/pig/PigTypesDesign
> http://wiki.apache.org/pig/PigExecutionModel
> This work is being done on the branch types, since it is large and disruptive, and we
want to be able to do incremental checkins without causing issues for the trunk.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message