Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D0DDED5E9 for ; Mon, 7 Jan 2013 22:02:15 +0000 (UTC) Received: (qmail 73196 invoked by uid 500); 7 Jan 2013 22:02:15 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 73148 invoked by uid 500); 7 Jan 2013 22:02:15 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 73068 invoked by uid 99); 7 Jan 2013 22:02:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Jan 2013 22:02:15 +0000 Date: Mon, 7 Jan 2013 22:02:15 +0000 (UTC) From: "Jeremy Karn (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HADOOP-9184) Some reducers failing to write final output file to s3. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-9184?page=3Dcom.atlassi= an.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Karn updated HADOOP-9184: -------------------------------- Status: Open (was: Patch Available) =20 > Some reducers failing to write final output file to s3. > ------------------------------------------------------- > > Key: HADOOP-9184 > URL: https://issues.apache.org/jira/browse/HADOOP-9184 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 0.20.2 > Reporter: Jeremy Karn > Attachments: example.pig, task_log.txt > > > We had a Hadoop job that was running 100 reducers with most of the reduce= rs expected to write out an empty file. When the final output was to an S3 = bucket we were finding that sometimes we were missing a final part file. T= his was happening approximately 1 job in 3 (so approximately 1 reducer out = of 300 was failing to output the data properly). I've attached the pig scri= pt we were using to reproduce the bug. > After an in depth look and instrumenting the code we traced the problem t= o moveTaskOutputs in FileOutputCommitter. =20 > The code there looked like: > {code} > if (fs.isFile(taskOutput)) { > =09=E2=80=A6 do stuff =E2=80=A6 =20 > } else if(fs.getFileStatus(taskOutput).isDir()) { > =09=E2=80=A6 do stuff =E2=80=A6=20 > } > {code} > And what we saw happening is that for the problem jobs neither path was b= eing exercised. I've attached the task log of our instrumented code. In t= his version we added an else statement and printed out the line "THIS SEEMS= LIKE WE SHOULD NEVER GET HERE =E2=80=A6". > The root cause of this seems to be an eventual consistency issue with S3.= You can see in the log that the first time moveTaskOutputs is called it f= inds that the taskOutput is a directory. It goes into the isDir() branch a= nd successfully retrieves the list of files in that directory from S3 (in t= his case just one file). This triggers a recursive call to moveTaskOutputs= for the file found in the directory. But in this pass through moveTaskOut= put the temporary output file can't be found resulting in both branches of = the above if statement being skipped and the temporary file never being mov= ed to the final output location. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs For more information on JIRA, see: http://www.atlassian.com/software/jira