pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuzhou Yin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-5360) Pig sets working directory of input file systems causes exception thrown
Date Thu, 27 Sep 2018 05:02:00 GMT

     [ https://issues.apache.org/jira/browse/PIG-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xuzhou Yin updated PIG-5360:
----------------------------
    Description: 
{color:#000000}In getSplits() method in PigInputFormat, Pig is trying to set the working directory
of input File System to jobContext.getWorkingDirectory(), which is always the default working
directory of default file system (eg. hdfs://host:port/user/userId in case of HDFS) unless
“mapreduce.job.working.dir” is explicitly set to non-default value. So if the input path
uses non-default file system, then it will fail since it is trying to set the working directory
of non-default file system to a HDFS path.{color}

{color:#000000}The proposed change is to completely remove this logic of setting working directory.
There are several reasons for doing so. {color}

{color:#000000}Firstly, getSplits() is only supposed to return a list of input splits. It
should not have side effects (especially doing so can potentially change the output path).
Having InputFormat changes OutputFormat does not make much sense here.
{color}

{color:#000000}Secondly, there is inconsistency between the working directories of input and
output file systems. if "mapreduce.job.working.dir" is set to non-default value, it will affect
the output path only (if it is a relative path) because input path will be made qualified
even before this logic.{color}

{color:#000000}Thirdly, there is already a "CD" functionality that allows customers to change
the working directory. However, this logic will overwrite the "CD" functionality if input
and output paths both use default file system.{color}

{color:#000000}Lastly, if customer has a sequence of jobs, changing the working directory
may change the input paths of downstream jobs if the input paths are specified as relative{color}

  was:
{color:#000000}In getSplits() method in PigInputFormat, Pig is trying to set the working directory
of input File System to jobContext.getWorkingDirectory(), which is always the default working
directory of default file system (eg. hdfs://host:port/user/userId in case of HDFS) unless
“mapreduce.job.working.dir” is explicitly set to non-default value. So if the input path
uses non-default file system, then it will fail since it is trying to set the working directory
of non-default file system to a HDFS path.{color}

{color:#000000}The proposed change is to completely remove this logic of setting working directory.
There are several reasons for doing so. {color}

{color:#000000}Firstly, getSplits() is only supposed to return a list of input splits. It
should not have side effects (especially doing so can potentially change the output path).{color}

{color:#000000}Secondly, there is inconsistency between the working directories of input and
output file systems. if "mapreduce.job.working.dir" is set to non-default value, it will affect
the output path only (if it is a relative path) because input path will be made qualified
even before this logic.{color}

{color:#000000}Thirdly, there is already a "CD" functionality that allows customers to change
the working directory. However, this logic will overwrite the "CD" functionality if input
and output paths both use default file system.{color}


> Pig sets working directory of input file systems causes exception thrown
> ------------------------------------------------------------------------
>
>                 Key: PIG-5360
>                 URL: https://issues.apache.org/jira/browse/PIG-5360
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.17.0
>            Reporter: Xuzhou Yin
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.18.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> {color:#000000}In getSplits() method in PigInputFormat, Pig is trying to set the working
directory of input File System to jobContext.getWorkingDirectory(), which is always the default
working directory of default file system (eg. hdfs://host:port/user/userId in case of HDFS)
unless “mapreduce.job.working.dir” is explicitly set to non-default value. So if the input
path uses non-default file system, then it will fail since it is trying to set the working
directory of non-default file system to a HDFS path.{color}
> {color:#000000}The proposed change is to completely remove this logic of setting working
directory. There are several reasons for doing so. {color}
> {color:#000000}Firstly, getSplits() is only supposed to return a list of input splits.
It should not have side effects (especially doing so can potentially change the output path).
Having InputFormat changes OutputFormat does not make much sense here.
> {color}
> {color:#000000}Secondly, there is inconsistency between the working directories of input
and output file systems. if "mapreduce.job.working.dir" is set to non-default value, it will
affect the output path only (if it is a relative path) because input path will be made qualified
even before this logic.{color}
> {color:#000000}Thirdly, there is already a "CD" functionality that allows customers to
change the working directory. However, this logic will overwrite the "CD" functionality if
input and output paths both use default file system.{color}
> {color:#000000}Lastly, if customer has a sequence of jobs, changing the working directory
may change the input paths of downstream jobs if the input paths are specified as relative{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message