hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Ding (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader
Date Fri, 13 Nov 2009 01:26:39 GMT

    [ https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777316#action_12777316
] 

Richard Ding commented on PIG-879:
----------------------------------

As a related issue, there is a feature in Pig right now that allows user to specify a local
file in a load statement even if Pig is running in MapReduce mode. Namely, 

{code}
A = load 'file:test/org/apache/pig/test/data/passwd' using PigStorage(':');
{code}

is a valid Pig statement. Internally Pig moves the file from the local file system to the
HDFS file system and gives the corresponding HDFS URI to the loader: 

{code}
hdfs://localhost.localdomain:37575/tmp/temp957104276/tmp124591329
{code}

As we move to the new Load/Store API, there are two options:

1. Stop supporting this feature. The above load statement can be replaced by the following
statements:

{code}
copyFromLocal ./test/org/apache/pig/test/data/passw ./passw
A = load './passw' using PigStorage(':');
{code}

2. The default implementation of _relativeToAbsolutePath(String location, Path curDir)_ method
will move the local file (specified by location) to the HDFS and return the corresponding
HDFS URI.

Any comments on these options? Especially does anyone have strong opinion against choosing
option 1? 


> Pig should provide a way for input location string in load statement to be passed as-is
to the Loader
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PIG-879
>                 URL: https://issues.apache.org/jira/browse/PIG-879
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>            Reporter: Pradeep Kamath
>            Assignee: Richard Ding
>
>  Due to multiquery optimization, Pig always converts the filenames to absolute URIs (see
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section about Incompatible
Changes - Path Names and Schemes). This is necessary since the script may have "cd .." statements
between load or store statements and if the load statements have relative paths, we would
need to convert to absolute paths to know where to load/store from. To do this QueryParser.massageFilename()
has the code below[1] which basically gives the fully qualified hdfs path
>  
> However the issue with this approach is that if the filename string is something like
"hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2",
the code below[1] actually translates this to hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2
and throws an exception that it is an incorrect path.
>  
> Some loaders may want to interpret the filenames (the input location string in the load
statement) in any way they wish and may want Pig to not make absolute paths out of them.
>  
> There are a few options to address this:
> 1)    A command line switch to indicate to Pig that pathnames in the script are all absolute
and hence Pig should not alter them and pass them as-is to Loaders and Storers. 
> 2)    A keyword in the load and store statements to indicate the same intent to pig
> 3)    A property which users can supply on cmdline or in pig.properties to indicate the
same intent.
> 4)    A method in LoadFunc - relativeToAbsolutePath(String filename, String curDir) which
does the conversion to absolute - this way Loader can chose to implement it as a noop.
> Thoughts?
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message