pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aniket Mokashi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2672) Optimize the use of DistributedCache
Date Thu, 26 Sep 2013 18:51:03 GMT

    [ https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13779085#comment-13779085
] 

Aniket Mokashi commented on PIG-2672:
-------------------------------------

[~rohini], from the current code, we have-
{code} 
Path dst = new Path(FileLocalizer.getTemporaryPath(pigContext).toUri().getPath(), suffix);

{code}
Hence, files are (by default) copied to /tmp/temp-<random>/. I do not see a way to configure
it to a relative path, but I might be wrong.

bq. UserEvil can figure out what the shared hdfs path is since he has access to the local
file.
This is true even today where UserEvil can look into jobconf to find the location of jars
and replace whatever jars if wanted. Even if they are protected like Rohini explained earlier,
still the protection is coming from HDFS and not pig.

I'm deliberately avoiding in permission checks in this code path. In terms of security, I
feel that this is no worse than what we have right now.

Next steps-
1. Address code review comments from RB and submit a fresh patch.
2. Run this for several jobs in practice and ensure there are no bad/side effects.
3. [~cheolsoo], can you please help me with e2e for this?
4. Open a documentation jira and explain how this works in pig docs.

Anything else I missed?
                
> Optimize the use of DistributedCache
> ------------------------------------
>
>                 Key: PIG-2672
>                 URL: https://issues.apache.org/jira/browse/PIG-2672
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Aniket Mokashi
>             Fix For: 0.12.0
>
>         Attachments: PIG-2672.patch
>
>
> Pig currently copies jar files to a temporary location in hdfs and then adds them to
DistributedCache for each job launched. This is inefficient in terms of 
>    * Space - The jars are distributed to task trackers for every job taking up lot of
local temporary space in tasktrackers.
>    * Performance - The jar distribution impacts the job launch time.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message