Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 15060 invoked from network); 2 Sep 2010 18:46:06 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Sep 2010 18:46:06 -0000 Received: (qmail 97214 invoked by uid 500); 2 Sep 2010 18:46:05 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 97168 invoked by uid 500); 2 Sep 2010 18:46:05 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 97160 invoked by uid 99); 2 Sep 2010 18:46:05 -0000 Received: from Unknown (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Sep 2010 18:46:05 +0000 X-ASF-Spam-Status: No, hits=4.7 required=10.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qy0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Sep 2010 18:45:43 +0000 Received: by qyk2 with SMTP id 2so1136228qyk.14 for ; Thu, 02 Sep 2010 11:45:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=ITPgUFCQHq/M663q/HajkgtOpTKnhxupE8lRy3VXDpM=; b=MxFHi7aDfDlnPJ/ATvU9f+btbVuACfDfnyzx2GXEDSKMtTEARzMbqtET+lYBvuHa3F 5tgiR9n7LNJqXwrfIaCxUM67MZbJ6XS8EDIsp29btW2GJPOdCwQfQMWivE1y+39nfudu muznZUe6AC1CBeokmIZdq2ZnpEgMmfewhot74= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Jgz1swhQumxqsaC8kwBara/dGEYuy9CQoTOsekK/8T/9I+mbXlzYjJLaqVG029hZC+ 1xtJ5NlNoYfNGHjSIhNEiwqIZY0yzlLsIzR/aC6sgYvACIfdEAZ2U9StdqlJ0jf5Q7sP NIunMg9aFLWlcJnbjirlOcAlD4HIep58HhDwk= MIME-Version: 1.0 Received: by 10.224.67.73 with SMTP id q9mr6756501qai.188.1283453122707; Thu, 02 Sep 2010 11:45:22 -0700 (PDT) Received: by 10.229.48.129 with HTTP; Thu, 2 Sep 2010 11:45:22 -0700 (PDT) In-Reply-To: References: Date: Thu, 2 Sep 2010 11:45:22 -0700 Message-ID: Subject: Re: Read After Write Consistency in HDFS From: Ted Yu To: mapreduce-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0015175cdd267702a2048f4b37bb X-Virus-Checked: Checked by ClamAV on apache.org --0015175cdd267702a2048f4b37bb Content-Type: text/plain; charset=ISO-8859-1 One possibility, due to the asynchronous nature of your loader, was that the consumer job started before all files from loader were written (propagated) completely. Can you describe what problem you encountered with OutputCollector ? On Thu, Sep 2, 2010 at 10:35 AM, Elton Pinto wrote: > Hello, > > I apologize if this topic has already been brought up, but I was unable to > find it by searching around. > > We recently discovered in issue in one of our jobs where the output of one > job does not seem to be making it into another job. The first job is a > loader job that's just a map step for asynchronously downloading external > data in multiple threads and then writing to HDFS directly (i.e. not using > the OutputCollector) using FileSystem and FSDataOutputStream. I believe we > did this because we had issues in this situation with writing using the > OutputCollector. > > The job that consumes this data runs directly after taking as the input > directory the output directory of the loader job. Very rarely, it looks like > not all the files are being consumed though, which we assume means that they > weren't yet propagated to HDFS yet. The volume of data being loaded is on > the order of 10 GB. > > Our fix that we're working on for this is to append the number of files > (i.e. number of mappers) to the file name and then checking that to ensure > that the actual number of files match expected, but I had a few questions > about this issue: > > 1) Has anyone else seen anything like this? Is read after write consistency > just not guaranteed on HDFS? > 2) Could it be an issue because we're not using an OutputCollector? > 3) Does anyone know an easy way to change the file name that the > OutputCollector uses? MultipleTextOutputFormat seems to only take in a > key/value pair to create file names whereas what we really want is the > JobConf so we can get the task number and the total number of tasks. If the > OutputCollector is also affected by this issue, then we have other jobs that > we need to set up this kind of check for. > > Thanks, > > Elton > > eptiger@gmail.com > epinto@alumni.cs.utexas.edu > http://www.eltonpinto.net/ > --0015175cdd267702a2048f4b37bb Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable One possibility, due to the asynchronous nature of your loader, was that th= e consumer job started before all files from loader were written (propagate= d) completely.

Can you describe what problem you encountered with Ou= tputCollector ?

On Thu, Sep 2, 2010 at 10:35 AM, Elton Pinto= <eptiger@gmail.c= om> wrote:
Hello,

I apologize if this topic has already been brought up, but I = was unable to find it by searching around.

We recently discovered in= issue in one of our jobs where the output of one job does not seem to be m= aking it into another job. The first job is a loader job that's just a = map step for asynchronously downloading external data in multiple threads a= nd then writing to HDFS directly (i.e. not using the OutputCollector) using= FileSystem and FSDataOutputStream. I believe we did this because we had is= sues in this situation with writing using the OutputCollector.

The job that consumes this data runs directly after taking as the input= directory the output directory of the loader job. Very rarely, it looks li= ke not all the files are being consumed though, which we assume means that = they weren't yet propagated to HDFS yet. The volume of data being loade= d is on the order of 10 GB.

Our fix that we're working on for this is to append the number of f= iles (i.e. number of mappers) to the file name and then checking that to en= sure that the actual number of files match expected, but I had a few questi= ons about this issue:

1) Has anyone else seen anything like this? Is read after write consist= ency just not guaranteed on HDFS?
2) Could it be an issue because we'= ;re not using an OutputCollector?
3) Does anyone know an easy way to cha= nge the file name that the OutputCollector uses? MultipleTextOutputFormat s= eems to only take in a key/value pair to create file names whereas what we = really want is the JobConf so we can get the task number and the total numb= er of tasks. If the OutputCollector is also affected by this issue, then we= have other jobs that we need to set up this kind of check for.

Thanks,

Elton

eptiger@gmail.com
epinto@alumni.cs.utexas.eduhttp://www.eltonp= into.net/

--0015175cdd267702a2048f4b37bb--