hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc
Date Sat, 05 Jun 2010 05:04:28 GMT

    [ https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875858#action_12875858

Scott Carey commented on PIG-1337:

Why not just allow a loader (or storer) the ability to set things on a conf object directly?
 DistributedCache won't be the only thing that I'll want access to.  I don't think Pig will
want to add new functions every time a Hadoop feature comes along that one wants access to.

Right now, users can set anything they want with properties on the script command line, but
have zero ability to set in compiled code!  This seems backwards to me.   A custom LoadFunc,
or StoreFunc should just either have access to the configuration that gets serialized for
the job, or, have the ability to return a Configuration object with settings it wishes Pig
will pass on (Pig can then ignore or overwrite things that a user should never touch, similar
to what happens from command line params).

Perhaps either a:

void configure(Configuration config);

method or

Configuration getCustomConfiguration();

method would be great.  The name for the loader and storer may have to differ as to not collide
for classes that implement both, and they should not share the method since the disambiguation
would be a problem (a load and store may not both want distributed cache, for example).

> Need a way to pass distributed cache configuration information to hadoop backend in Pig's
> --------------------------------------------------------------------------------------------------
>                 Key: PIG-1337
>                 URL: https://issues.apache.org/jira/browse/PIG-1337
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Chao Wang
>             Fix For: 0.8.0
> The Zebra storage layer needs to use distributed cache to reduce name node load during
job runs.
> To to this, Zebra needs to set up distributed cache related configuration information
in TableLoader (which extends Pig's LoadFunc) .
> It is doing this within getSchema(conf). The problem is that the conf object here is
not the one that is being serialized to map/reduce backend. As such, the distributed cache
is not set up properly.
> To work over this problem, we need Pig in its LoadFunc to ensure a way that we can use
to set up distributed cache information in a conf object, and this conf object is the one
used by map/reduce backend.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message