hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pi Song (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-102) Dont copy to DFS if source filesystem marked as shared
Date Tue, 11 Mar 2008 22:28:50 GMT

    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577633#action_12577633
] 

Pi Song commented on PIG-102:
-----------------------------

Ben,

I think about it this way (you may disagree):-

- The basic concept  is you've got source file system X, execution engine Y, and destination
file system.  HDFS+MapReduce where source files are in HDFS perfectly fits with this model.
Craig's NFS + MapReduce "no copy across" also fits.
- Now if you have input files in a file system X and say the execution engine Y only executes
on its own file system Z. Then it is a responsbility of the execution engine to pull files
from the source file system to its temporary storage Z in order to execute. Therefore after
it's done, the output should be copied back to the real file system (Leaving the output on
temporary storage Z doesn't sound good). I'm just trying to define a good semantic in the
first place.

> Dont copy to DFS if source filesystem marked as shared
> ------------------------------------------------------
>
>                 Key: PIG-102
>                 URL: https://issues.apache.org/jira/browse/PIG-102
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>         Environment: Installations with shared folders on all nodes (eg NFS)
>            Reporter: Craig Macdonald
>         Attachments: shared.patch
>
>
> I've been playing with Pig using three setups:
> (a) local
> (b) hadoop mapred with hdfs
> (c) hadoop mapred with file:///path/to/shared/fs as the default file system
> In our local setup, various NFS filesystems are shared between all machines (including
mapred nodes)  eg /users, /local
> I would like Pig to note when input files are in a file:// directory that has been marked
as shared, and hence not copy it to DFS.
> Similarly, the Torque PBS resource manager has a usecp directive, which notes when a
filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can
be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
> It would be good to have a configurable setting in Pig that says when a filesystem is
shared, and hence no copying between file:// and hdfs:// is needed.
> An example in our setup might be:
> sharedFS file:///local/
> sharedFS file:///users/
> if commands should be used.
> This command should be used with care. Obviously if you have 1000 nodes all accessing
a shared file in NFS, then it would have been better to "hadoopify" the file.
> The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String,
PigContext)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message