hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc
Date Tue, 21 Sep 2010 18:32:35 GMT

    [ https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913154#action_12913154
] 

Alan Gates commented on PIG-1337:
---------------------------------

The problem with allowing load and store functions access to the config file is that the config
file they see is not the config file that goes to Hadoop.  This is not all Pig's fault (see
comments above on this).  The other problem is that multiple instances of the same load and
store function may be operating in a given script, so there are namespace issues to resolve.

The proposal for Hadoop 0.22 is that rather than providing access to the config file at all
Hadoop will serialize objects such as InputFormat and OutputFormat and pass those to the backend.
 It will make sense for Pig to follow suit and serialize all UDFs on the front end.  This
will remove the need for the  UDFContext black magic that we do at the moment and should allow
all UDFs to easily transfer information from front end to backend.

So, hopefully this can get resolved when Pig migrates to Hadoop 0.22, whenever that is.

> Need a way to pass distributed cache configuration information to hadoop backend in Pig's
LoadFunc
> --------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1337
>                 URL: https://issues.apache.org/jira/browse/PIG-1337
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Chao Wang
>
> The Zebra storage layer needs to use distributed cache to reduce name node load during
job runs.
> To to this, Zebra needs to set up distributed cache related configuration information
in TableLoader (which extends Pig's LoadFunc) .
> It is doing this within getSchema(conf). The problem is that the conf object here is
not the one that is being serialized to map/reduce backend. As such, the distributed cache
is not set up properly.
> To work over this problem, we need Pig in its LoadFunc to ensure a way that we can use
to set up distributed cache information in a conf object, and this conf object is the one
used by map/reduce backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message