hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stan Rosenberg (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5247) FileInputFormat should filter files with '._COPYING_' sufix
Date Fri, 19 Jul 2013 17:00:52 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13713843#comment-13713843

Stan Rosenberg commented on MAPREDUCE-5247:

Robert, what is the intended operational meaning of 'hdfs fs -put local dst'?  Is it not that
file denoted by local is "atomically" transferred into dst?  If that's the case, then I'd
argue that it's broken---from MR perspective the transfer is not truly atomic since files
with the suffix .COPYING are _visible_. 

As I've indicated above, we have jobs which execute as soon as new data is available for that
(hdfs) partition.  The external scheduler knows when new data has finished loading, namely
when all pending hdfs 'put' operations complete. 
 (Think of it as a special type of job in the sense that it runs many times per hour, every
time processing a superset of the input files.)  

Your claim that MR was not designed to run on data that is changing underneath it seems rather
putative.  What is wrong with the above approach assuming that the intended semantics of 'put'
is atomic transfer without (MR) observable side effect of .COPYING?  (In other words, if MR
is oblivious to .COPYING, then data it not changing underneath it.)
> FileInputFormat should filter files with '._COPYING_' sufix
> -----------------------------------------------------------
>                 Key: MAPREDUCE-5247
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5247
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Stan Rosenberg
> FsShell copy/put creates staging files with '._COPYING_' suffix.  These files should
be considered hidden by FileInputFormat.  (A simple fix is to add the following conjunct to
the existing hiddenFilter: 
> {code}
> !name.endsWith("._COPYING_")
> {code}
> After upgrading to CDH 4.2.0 we encountered this bug. We have a legacy data loader which
uses 'hadoop fs -put' to load data into hourly partitions.  We also have intra-hourly jobs
which are scheduled to execute several times per hour using the same hourly partition as input.
 Thus, as the new data is continuously loaded, these staging files (i.e., ._COPYING_) are
breaking our jobs (since when copy/put completes staging files are moved).
> As a workaround, we've defined a custom input path filter and loaded it with "mapred.input.pathFilter.class".

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message