pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao Wang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc
Date Thu, 01 Apr 2010 16:40:27 GMT

    [ https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852445#action_12852445

Chao Wang commented on PIG-1337:

It's ok for us not to use getSchema() for this purpose since it's a pure getter method.

What we need is simply a setter method in LoadFunc through which we can set up distributed
cache. Pig needs to ensure that this information is indeed in the job configuration variable
that's being passed to hadoop backend.
Also, this setter method should be only invoked at Pig's frondend.  In the case of one m/r
job containing multiple LoadFunc instances, Pig may need to combine distributed cache configuration
information from all instances.

Also, we note that using the UDFContext  to convey information from frontend to backend is
not working for this.  We need the job configuration variable already contain all the distributed
cache related information when it's being passed to the hadoop backend.

> Need a way to pass distributed cache configuration information to hadoop backend in Pig's
> --------------------------------------------------------------------------------------------------
>                 Key: PIG-1337
>                 URL: https://issues.apache.org/jira/browse/PIG-1337
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Chao Wang
>             Fix For: 0.8.0
> The Zebra storage layer needs to use distributed cache to reduce name node load during
job runs.
> To to this, Zebra needs to set up distributed cache related configuration information
in TableLoader (which extends Pig's LoadFunc) .
> It is doing this within getSchema(conf). The problem is that the conf object here is
not the one that is being serialized to map/reduce backend. As such, the distributed cache
is not set up properly.
> To work over this problem, we need Pig in its LoadFunc to ensure a way that we can use
to set up distributed cache information in a conf object, and this conf object is the one
used by map/reduce backend.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message