Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AE294C4A8 for ; Fri, 19 Jul 2013 15:02:55 +0000 (UTC) Received: (qmail 64375 invoked by uid 500); 19 Jul 2013 15:02:55 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 64155 invoked by uid 500); 19 Jul 2013 15:02:54 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 63354 invoked by uid 99); 19 Jul 2013 15:02:50 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Jul 2013 15:02:50 +0000 Date: Fri, 19 Jul 2013 15:02:50 +0000 (UTC) From: "Robert Joseph Evans (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-5247) FileInputFormat should filter files with '._COPYING_' sufix MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-5247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13713743#comment-13713743 ] Robert Joseph Evans commented on MAPREDUCE-5247: ------------------------------------------------ Why are you running a Map/Reduce job with input from a directory that has not finished being copied? MR was not designed to run on data that is changing underneath it. When the job is done how do you know which of the input files were actually used to produce the output? This issue existed prior to 2.0 but was even worse without the ._COPYING_ suffix. In those cases the files were opened in place and data started to be copied into them. You may have only even gotten a part of the file in your MR job, not all of it. The file could have disappeared out from under the MR job if an error occurred. This is not behavior that I want to make a common park of Map/Reduce. If you want to do this and you know the risks then you can filter ._COPYING_ files out of your list of input files to the MR job. But I don't want the framework to do it automatically for everyone. > FileInputFormat should filter files with '._COPYING_' sufix > ----------------------------------------------------------- > > Key: MAPREDUCE-5247 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5247 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Stan Rosenberg > > FsShell copy/put creates staging files with '._COPYING_' suffix. These files should be considered hidden by FileInputFormat. (A simple fix is to add the following conjunct to the existing hiddenFilter: > {code} > !name.endsWith("._COPYING_") > {code} > After upgrading to CDH 4.2.0 we encountered this bug. We have a legacy data loader which uses 'hadoop fs -put' to load data into hourly partitions. We also have intra-hourly jobs which are scheduled to execute several times per hour using the same hourly partition as input. Thus, as the new data is continuously loaded, these staging files (i.e., ._COPYING_) are breaking our jobs (since when copy/put completes staging files are moved). > As a workaround, we've defined a custom input path filter and loaded it with "mapred.input.pathFilter.class". -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira