hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1053) Consider moving to Hadoop for local mode
Date Mon, 26 Oct 2009 22:18:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770237#action_12770237
] 

Alan Gates commented on PIG-1053:
---------------------------------

Currently Pig has its own backend implementation framework that it uses for executing Pig
Latin scripts on a single box (as opposed to in a Hadoop cluster), referred to as local mode.
 Having a separate implementation has several drawbacks:

1) It does not offer the same functionality as Hadoop.  A number of things do not work, such
as counters, slicers, etc.
2) UDFs (both eval and load/store functions) are often forced to understand both contexts,
and test whether they are working in local or hadoop mode.
3) Additional code maintenance, as Pig is forced to maintain its own framework.  Going forward,
as Pig attempts to leverage more Map Reduce functionality (see for example PIG-966) maintaining
this separate mode is becoming a larger and larger effort.
4) It makes debugging harder for users and UDF writers, as the execution environment on a
local box differs from that on the production cluster.

Pig's local mode has one very serious advantage over Hadoop in local mode.  It is much faster,
about 15 times faster.  Hadoop is designed for large data sets and thus is not optimized to
handle the start up and tear down involved in small data jobs.

For debugging of code, this performance factor should not be that big an issue.  Where the
performance becomes prohibitive is functionality like ILLUSTRATE.  Taking 30 seconds to give
a sample of data running through your script is excessive compared to 2 seconds.

So, which of these pain points is worse?  Originally we felt the performance was more important.
 But as we see many user complaints about the above listed drawbacks and relatively few users
using local mode in performance intensive ways, we are wondering if we made that choice correctly.
 Please give your feedback one way or another.


> Consider moving to Hadoop for local mode
> ----------------------------------------
>
>                 Key: PIG-1053
>                 URL: https://issues.apache.org/jira/browse/PIG-1053
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Alan Gates
>
> We need to consider moving Pig to use Hadoop's local mode instead of its own.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message