Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B23AAE399 for ; Fri, 8 Feb 2013 19:07:09 +0000 (UTC) Received: (qmail 14279 invoked by uid 500); 8 Feb 2013 19:07:03 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 14142 invoked by uid 500); 8 Feb 2013 19:07:03 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 14132 invoked by uid 99); 8 Feb 2013 19:07:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Feb 2013 19:07:03 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of tucu@cloudera.com designates 209.85.223.174 as permitted sender) Received: from [209.85.223.174] (HELO mail-ie0-f174.google.com) (209.85.223.174) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Feb 2013 19:06:56 +0000 Received: by mail-ie0-f174.google.com with SMTP id k10so5363489iea.19 for ; Fri, 08 Feb 2013 11:06:34 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:x-gm-message-state; bh=WwLRIpAbdWEv5rXDs5rany4R4dF70bDy6SmpRbNINtI=; b=QAwRnrEm/CJ9yVm0hKsEPC/yi8Q5RcZQ/H4QqEEiHcSEQhnI1XxpLItLG7XBU5zInw MUKtT8mBbIVC0ykAB/IQZj4KraAs57oqAEgA/39nndsES+1tTNePxtKgXvCTOb0ff1WO ilff3DxEmBcP95rVNR1N9qokgnpvwsEbb4INXeqk4PJKEB89X0gzMlD0E1UMQDAVAXLX MUsimfDFWiMbSGYedThoa3o8FQGwj4M/Mq8dE7k98eVh1lLonnXgNRebhUM7034vU6g+ Ej/+T8S1leR2qx0DxHm7eKpoiVRQ9o6qIXrX8uBL1cjI6slWrsTKaVndGVEaD3cwUnJE ni1A== X-Received: by 10.50.45.168 with SMTP id o8mr4569118igm.41.1360350394649; Fri, 08 Feb 2013 11:06:34 -0800 (PST) MIME-Version: 1.0 Received: by 10.231.125.201 with HTTP; Fri, 8 Feb 2013 11:06:04 -0800 (PST) In-Reply-To: <556325346CA26341B6F0530E07F90D96016C64CD96CD@GBGH-EXCH-CMS.sig.ads> References: <20120828115859.240490@gmx.net> <556325346CA26341B6F0530E07F90D96016C64CD95C7@GBGH-EXCH-CMS.sig.ads> <556325346CA26341B6F0530E07F90D96016C64CD95C9@GBGH-EXCH-CMS.sig.ads> <556325346CA26341B6F0530E07F90D96016C64CD95D0@GBGH-EXCH-CMS.sig.ads> <556325346CA26341B6F0530E07F90D96016C64CD95D5@GBGH-EXCH-CMS.sig.ads> <556325346CA26341B6F0530E07F90D96016C64ED6C81@GBGH-EXCH-CMS.sig.ads> <556325346CA26341B6F0530E07F90D96016C64ED6C82@GBGH-EXCH-CMS.sig.ads> <556325346CA26341B6F0530E07F90D96016C64CD96CC@GBGH-EXCH-CMS.sig.ads> <556325346CA26341B6F0530E07F90D96016C64CD96CD@GBGH-EXCH-CMS.sig.ads> From: Alejandro Abdelnur Date: Fri, 8 Feb 2013 11:06:04 -0800 Message-ID: Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat To: "common-user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=14dae93403a90b0cab04d53b4236 X-Gm-Message-State: ALoCoQnUlkXCNG5QlXcLnt0ksuKiIfb5mPi/+cnj/wpWA020Rwr3oXQ5n1d3CtuzMOPrK7LCzaJ9 X-Virus-Checked: Checked by ClamAV on apache.org --14dae93403a90b0cab04d53b4236 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Tony, I think the first step would be to verify if the S3 filesystem implementation rename works as expected. Thx On Fri, Feb 1, 2013 at 7:12 AM, Tony Burton wrot= e: > ** ** > > Thanks for the reply Alejandro. Using a temp output directory was my firs= t > guess as well. What=92s the best way to proceed? I=92ve come across > FileSystem.rename but it=92s consistently returning false for whatever Pa= ths > I provide. Specifically, I need to copy the following:**** > > ** ** > > s3://///part-00000**** > > =85**** > > s3://///part-nnnnn**** > > s3://///part-00000**** > > =85**** > > s3://///part-nnnnn**** > > =85**** > > s3://///part-nnnnn**** > > ** ** > > to **** > > ** ** > > s3:////part-00000**** > > =85**** > > s3:////part-nnnnn**** > > s3:////part-00000**** > > =85**** > > s3:////part-nnnnn**** > > =85**** > > s3:////part-nnnnn**** > > ** ** > > without doing a copyToLocal.**** > > ** ** > > Any tips? Are there any better alternatives to FileSystem.rename? Or woul= d > using the AWS Java SDK be a better solution?**** > > ** ** > > Thanks!**** > > ** ** > > Tony**** > > ** ** > > ** ** > > ** ** > > ** ** > > ** ** > > ** ** > > *From:* Alejandro Abdelnur [mailto:tucu@cloudera.com] > *Sent:* 31 January 2013 18:45 > *To:* common-user@hadoop.apache.org > > *Subject:* Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat**** > > ** ** > > Hi Tony, from what i understand your prob is not with MTOF but with you > wanting to run 2 jobs using the same output directory, the second job wil= l > fail because the output dir already existed. My take would be tweaking yo= ur > jobs to use a temp output dir, and moving them to the required (final) > location upon completion.**** > > ** ** > > thx**** > > ** ** > > ** ** > > On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton > wrote:**** > > Hi everyone, > > Some of you might recall this topic, which I worked on with the list's > help back in August last year - see email trail below. Despite initial > success of the discovery, I had the shelve the approach as I ended up usi= ng > a different solution (for reasons I forget!) with the implementation that > was ultimately used for that particular project. > > I'm now in a position to be working on a similar new task, where I've > successfully implemented the combination of LazyOutputFormat and > MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output > locations. However, I've hit another snag which I'm hoping you might help > me work through. > > I'm going to be running daily tasks to extract data from XML files > (specifically, the data stored in certain nodes of the XML), stored on AW= S > S3 using object names with the following format: > > s3://inputbucket/data/2013/1/13/ > > I want to extract items from the XML and write out as follows: > > s3://outputbucket/path//20130113/ > > For one day of data, this works fine. I pass in s3://inputbucket/data and > s3://outputbucket/path as input and output arguments, along with my run > date (20130113) which gets manipulated and appended where appropriate to > form the precise read and write locations, for example > > FileInputFormat.setInputhPath(job, " s3://inputbucket/data"); > FileOutputFormat.setOutputPath(job, "s3://outputbucket/path"); > > Then MultipleOutputs adds on my XML node names underneath > s3://outputbucket/path automatically. > > However, for the next day's run, the job gets to > FileOutputFormat.setOutputPath and sees that the output path > (s3://outputbucket/path) already exists, and throws a > FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - eve= n > though my ultimate subdirectory, to be constructed by MultipleOutputs doe= s > not already exist. > > Is there any way around this? I'm given hope by this, from > http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlready= ExistsException.html: > "public class FileAlreadyExistsException extends IOException - Used when > target file already exists for any operation *and is not configured to be > overwritten*" (my emphasis). Is it possible to deconfigure the overwrite > protection? > > If not, I suppose one other way ahead is to create my own FileOutputForma= t > where the checkOutputSpecs() is a bit less fussy; another might be to wri= te > to a "temp" directory and programmatically move it to the desired output > when the job completes successfully, although this is getting to feel a b= it > "hacky" to me. > > Thanks for any feedback! > > Tony > > > > > > > > ________________________________________ > From: Harsh J [harsh@cloudera.com] > Sent: 31 August 2012 10:47 > To: user@hadoop.apache.org > Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat > > Good finding, that OF slipped my mind. We can mention on the > MultipleOutputs javadocs for the new API to use the LazyOutputFormat for > the job-level config. Please file a JIRA for this under MAPREDUCE project > on the Apache JIRA? > > On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton > wrote: > > Hi Harsh, > > > > I tried using NullOutputFormat as you suggested, however simply using > > > > job.setOutputFormatClass(NullOutputFormat.class); > > > > resulted in no output at all. Although I've not tried overriding > getOutputCommitter in NullOutputFormat as you suggested, I discovered > LazyOutputFormat which only writes when it has to, "the output file is > created only when the first record is emitted for a given partition" (fro= m > "Hadoop: The Definitive Guide"). > > > > Instead of > > > > job.setOutputFormatClass(TextOutputFormat.class); > > > > use LazyOutputFormat like this: > > > > LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class); > > > > So now my unnamed MultipleOutputs are handling to segmented results, an= d > LazyOutputFormat is suppressing the default output. Good job! > > > > Tony > > > > > > > > > > > > ________________________________________ > > From: Harsh J [harsh@cloudera.com] > > Sent: 29 August 2012 17:05 > > To: user@hadoop.apache.org > > Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat > > > > Hi Tony, > > > > On Wed, Aug 29, 2012 at 9:30 PM, Tony Burton > wrote: > >> Success so far! > >> > >> I followed the example given by Tom on the link to the > MultipleOutputs.html API you suggested. > >> > >> I implemented a WordCount MR job using hadoop 1.0.3 and segmented the > output depending on word length: output to directory "sml" for less than = 10 > characters, "med" for between 10 and 20 characters, "lrg" otherwise. > >> > >> I used out.write(key, new IntWritable(sum), generateFilename(key, > >> sum)); to write the output, and generateFileName to create the custom > >> directory name/filename. You need to provide the start of the > >> filename as well otherwise your output files will be -r-00000, > >> -r-00001 etc. (so, for example, return "sml/part"; etc) > > > > Thanks for these notes, should come helpful for those who search! > > > >> Also required: as Tom states, override Reducer.setup() to create the > MultipleOutputs. However, Tom's puzzle left for the reader is that you al= so > need to override Reducer.cleanup() and call close() on your MultipleOutpu= ts > object. Forget to do this and your segmented files will be empty. > > > > Ah yes this is important. Non closure of files would have you wait for > > an hour for data to get available to readers (open writer lease expiry > > period). > > > >> One observation: although it's not the end of the world, as well as my > segmented output I also get a zero-size part-r-00000 file in the base of = my > output path. Is there any way to prevent creation of this file? > > > > Set the OutputFormat to NullOutputFormat. > > > > In case you face issues doing this in new API (you may notice some odd > > behavior) try to extend NullOutputFormat and in its getOutputCommitter > > method i.e. > > http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/mapr > > educe/lib/output/NullOutputFormat.html#getOutputCommitter(org.apache.h > > adoop.mapreduce.TaskAttemptContext), > > return a FileOutputCommitter object. By default it returns a no-op > > OutputCommitter that may not gel well with a file-based writer such as > > MultipleOutputs. Then set this new OutputFormat as your job's output > > format. > > > >> Thanks again Harsh for pointing the way. > >> > >> Tony > >> > >> > >> > >> > >> > >> > >> > >> -----Original Message----- > >> From: Tony Burton [mailto:TBurton@SportingIndex.com] > >> Sent: 29 August 2012 11:38 > >> To: user@hadoop.apache.org > >> Subject: RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat > >> > >> Thanks Harsh! Will try it out and report back later. > >> > >> > >> -----Original Message----- > >> From: Harsh J [mailto:harsh@cloudera.com] > >> Sent: 29 August 2012 11:12 > >> To: user@hadoop.apache.org > >> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat > >> > >> Hi Tony, > >> > >> Seeing your new question, I recalled Tom's post to a user once, here: > >> https://groups.google.com/a/cloudera.org/d/msg/cdh-user/pdyVyydt5Ys/1 > >> CaLukt4v1AJ > >> > >> This specific call allows you to specify / characters in your name, > >> that gets translated into creation of directories automatically: > >> http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/map > >> reduce/lib/output/MultipleOutputs.html#write(KEYOUT,%20VALUEOUT,%20ja > >> va.lang.String) (The last argument is where you will need to specify > >> the path) > >> > >> Try it out and let us know! > >> > >> On Tue, Aug 28, 2012 at 7:06 PM, Tony Burton > wrote: > >>> Hi Harsh > >>> > >>> Thanks for the reply - my understanding is that with MultipleOutputs = I > can write differently named files into the same target directory. With > MultipleTextOutputFormat I was able to override the target directory name > to perform the segmentation, by overriding generateFileNameForKeyValue(). > >>> > >>> Does the 1.0.3 MultipleOutputs give me the ability to alter the targe= t > directory name as well as the file name? > >>> > >>> Thanks, > >>> > >>> Tony > >>> > >>> > >>> > >>> -----Original Message----- > >>> From: Harsh J [mailto:harsh@cloudera.com] > >>> Sent: 28 August 2012 13:44 > >>> To: user@hadoop.apache.org > >>> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat > >>> > >>> The Multiple*OutputFormat have been deprecated in favor of the > >>> generic MultipleOutputs API. Would using that instead work for you? > >>> > >>> On Tue, Aug 28, 2012 at 6:05 PM, Tony Burton < > TBurton@sportingindex.com> wrote: > >>>> Hi, > >>>> > >>>> I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat > is good for writing results into (for example) different directories > created on the fly. However, now I'm implementing a MapReduce job using > Hadoop 1.0.3, I see that the new API no longer supports > MultipleTextOutputFormat. Is there an equivalent that I can use, or will = it > be supported in a future release? > >>>> > >>>> Thanks, > >>>> > >>>> Tony > >>>> > >>>> > >>>> ******************************************************************* > >>>> *** This email and any attachments are confidential, protected by > >>>> copyright and may be legally privileged. If you are not the > >>>> intended recipient, then the dissemination or copying of this email > is prohibited. If you have received this in error, please notify the send= er > by replying by email and then delete the email completely from your syste= m. > Neither Sporting Index nor the sender accepts responsibility for any > virus, or any other defect which might affect any computer or IT system > into which the email is received and/or opened. It is the responsibility > of the recipient to scan the email and no responsibility is accepted for > any loss or damage arising in any way from receipt or use of this email. > Sporting Index Ltd is a company registered in England and Wales with > company number 2636842, whose registered office is at Gateway House, > Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and > regulated by the UK Financial Services Authority (reg. no. 150404) and > Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial > promotion contained herein has been issued and approved by Sporting Index > Ltd. > >>>> > >>>> Outbound email has been scanned for viruses and SPAM > >>>> > >>> > >>> > >>> > >>> -- > >>> Harsh J > >>> www.sportingindex.com > >>> Inbound Email has been scanned for viruses and SPAM > >>> ******************************************************************** > >>> ** This email and any attachments are confidential, protected by > >>> copyright and may be legally privileged. If you are not the > >>> intended recipient, then the dissemination or copying of this email i= s > prohibited. If you have received this in error, please notify the sender = by > replying by email and then delete the email completely from your system. > Neither Sporting Index nor the sender accepts responsibility for any > virus, or any other defect which might affect any computer or IT system > into which the email is received and/or opened. It is the responsibility > of the recipient to scan the email and no responsibility is accepted for > any loss or damage arising in any way from receipt or use of this email. > Sporting Index Ltd is a company registered in England and Wales with > company number 2636842, whose registered office is at Gateway House, > Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and > regulated by the UK Financial Services Authority (reg. no. 150404) and > Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial > promotion contained herein has been issued and approved by Sporting Index > Ltd. > >>> > >>> Outbound email has been scanned for viruses and SPAM > >> > >> > >> > >> -- > >> Harsh J > >> www.sportingindex.com > >> Inbound Email has been scanned for viruses and SPAM > >> ********************************************************************* > >> * This email and any attachments are confidential, protected by > >> copyright and may be legally privileged. If you are not the intended > >> recipient, then the dissemination or copying of this email is > prohibited. If you have received this in error, please notify the sender = by > replying by email and then delete the email completely from your system. > Neither Sporting Index nor the sender accepts responsibility for any > virus, or any other defect which might affect any computer or IT system > into which the email is received and/or opened. It is the responsibility > of the recipient to scan the email and no responsibility is accepted for > any loss or damage arising in any way from receipt or use of this email. > Sporting Index Ltd is a company registered in England and Wales with > company number 2636842, whose registered office is at Gateway House, > Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and > regulated by the UK Financial Services Authority (reg. no. 150404) and > Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial > promotion contained herein has been issued and approved by Sporting Index > Ltd. > >> > >> Outbound email has been scanned for viruses and SPAM > >> www.sportingindex.com Inbound Email has been scanned for viruses and > >> SPAM > >> ********************************************************************* > >> * This email and any attachments are confidential, protected by > >> copyright and may be legally privileged. If you are not the intended > >> recipient, then the dissemination or copying of this email is > prohibited. If you have received this in error, please notify the sender = by > replying by email and then delete the email completely from your system. > Neither Sporting Index nor the sender accepts responsibility for any > virus, or any other defect which might affect any computer or IT system > into which the email is received and/or opened. It is the responsibility > of the recipient to scan the email and no responsibility is accepted for > any loss or damage arising in any way from receipt or use of this email. > Sporting Index Ltd is a company registered in England and Wales with > company number 2636842, whose registered office is at Gateway House, > Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and > regulated by the UK Financial Services Authority (reg. no. 150404) and > Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial > promotion contained herein has been issued and approved by Sporting Index > Ltd. > >> > >> Outbound email has been scanned for viruses and SPAM > > > > > > > > -- > > Harsh J > > www.sportingindex.com > > Inbound Email has been scanned for viruses and SPAM > > ********************************************************************** > > This email and any attachments are confidential, protected by > > copyright and may be legally privileged. If you are not the intended > recipient, then the dissemination or copying of this email is prohibited. > If you have received this in error, please notify the sender by replying = by > email and then delete the email completely from your system. Neither > Sporting Index nor the sender accepts responsibility for any virus, or an= y > other defect which might affect any computer or IT system into which the > email is received and/or opened. It is the responsibility of the recipie= nt > to scan the email and no responsibility is accepted for any loss or damag= e > arising in any way from receipt or use of this email. Sporting Index Ltd > is a company registered in England and Wales with company number 2636842, > whose registered office is at Gateway House, Milverton Street, London, SE= 11 > 4AP. Sporting Index Ltd is authorised and regulated by the UK Financial > Services Authority (reg. no. 150404) and Gambling Commission (reg. no. > 000-027343-R-308898-001). Any financial promotion contained herein has > been issued and approved by Sporting Index Ltd. > > > > Outbound email has been scanned for viruses and SPAM > > > > -- > Harsh J > www.sportingindex.com > Inbound Email has been scanned for viruses and SPAM > ********************************************************************** > This email and any attachments are confidential, protected by copyright > and may be legally privileged. If you are not the intended recipient, th= en > the dissemination or copying of this email is prohibited. If you have > received this in error, please notify the sender by replying by email and > then delete the email completely from your system. Neither Sporting Inde= x > nor the sender accepts responsibility for any virus, or any other defect > which might affect any computer or IT system into which the email is > received and/or opened. It is the responsibility of the recipient to sca= n > the email and no responsibility is accepted for any loss or damage arisin= g > in any way from receipt or use of this email. Sporting Index Ltd is a > company registered in England and Wales with company number 2636842, whos= e > registered office is at Gateway House, Milverton Street, London, SE11 4AP= . > Sporting Index Ltd is authorised and regulated by the UK Financial > Services Authority (reg. no. 150404) and Gambling Commission (reg. no. > 000-027343-R-308898-001). Any financial promotion contained herein has > been issued and approved by Sporting Index Ltd. > > Outbound email has been scanned for viruses and SPAM www.sportingindex.co= mInbound Email has been scanned for viruses and SPAM > **** > > > > **** > > ** ** > > -- > Alejandro **** > > ** ** > > > > *************************************************************************= **** > P *Please consider the environment before printing this email* **** > > > www.sportingindex.com > > Inbound email has been scanned for viruses & spam**** > --=20 Alejandro --14dae93403a90b0cab04d53b4236 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
Tony, I think the first step would be to verify if the S3 = filesystem implementation rename works as expected.

Thx<= /div>


On= Fri, Feb 1, 2013 at 7:12 AM, Tony Burton <TBurton@sportingindex.c= om> wrote:

=A0

Thanks for the reply Alejandro. Using a = temp output directory was my first guess as well. What=92s the best way to = proceed? I=92ve come across FileSystem.rename but it=92s consistently retur= ning false for whatever Paths I provide. Specifically, I need to copy the f= ollowing:

=A0

s3://<path to data>/<tmp folder>/<obj= ect type 1>/part-00000

=85

s3://<path to data>/<tmp folder>/<obj= ect type 1>/part-nnnnn

s3://<path to data>/<tmp folder= >/<object type 2>/part-00000

=85

s3://<path to data>/<tmp folder>/<object type 2>/part-nn= nnn

=85

s3://<path to data>/<tmp folder>/<obj= ect type m>/part-nnnnn

=A0

to

=A0

s3://<path to data>/<object type 1>/part= -00000

=85

s3://<path to data>/<object type 1>/part= -nnnnn

s3://<path to data>/<object typ= e 2>/part-00000

=85

s3://<path to data>/<object typ= e 2>/part-nnnnn

=85

s3://<path to data>/<object typ= e m>/part-nnnnn

=A0

without doing a copyToLocal.

=A0

Any tips? Are there any better alternati= ves to FileSystem.rename? Or would using the AWS Java SDK be a better solut= ion?

=A0

Thanks!

=A0

Tony

=A0

=A0

=A0

=A0

=A0

=A0

From: Alejandro Abdelnur [mailto:tucu@cloudera.com]
Sent: 31 January 2013 18:45
To: common-user@hadoop.apache.org<= /span>


Subject: Re: hadoop 1.0.3 equiv= alent of MultipleTextOutputFormat

=A0

Hi Tony, from what i understand your prob = is not with MTOF but with you wanting to run 2 jobs using the same output d= irectory, the second job will fail because the output dir already existed. = My take would be tweaking your jobs to use a temp output dir, and moving th= em to the required (final) location upon completion.

=A0

thx

=A0<= /p>

=A0

On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton &l= t;TBurton@sp= ortingindex.com> wrote:

Hi e= veryone,

Some of you might recall this topic, which I worked on with the list= 9;s help back in August last year - see email trail below. Despite initial = success of the discovery, I had the shelve the approach as I ended up using= a different solution (for reasons I forget!) with the implementation that = was ultimately used for that particular project.

I'm now in a position to be working on a similar new task, where I&= #39;ve successfully implemented the combination of LazyOutputFormat and Mul= tipleOutputs using hadoop 1.0.3 to write out to multiple custom output loca= tions. However, I've hit another snag which I'm hoping you might he= lp me work through.

I'm going to be running daily tasks to extract data from XML files = (specifically, the data stored in certain nodes of the XML), stored on AWS = S3 using object names with the following format:

s3://inputbucket/da= ta/2013/1/13/<list of xml data files.bz2>

I want to extract items from the XML and write out as follows:

s= 3://outputbucket/path/<xml node name>/20130113/<output from MR job= >

For one day of data, this works fine. I pass in s3://inputbucke= t/data and s3://outputbucket/path as input and output arguments, along with= my run date (20130113) which gets manipulated and appended where appropria= te to form the precise read and write locations, for example

FileInputFormat.setInputhPath(job, " s3://inputbucket/data");=
FileOutputFormat.setOutputPath(job, "s3://outputbucket/path")= ;

Then MultipleOutputs adds on my XML node names underneath s3://out= putbucket/path automatically.

However, for the next day's run, the job gets to FileOutputFormat.s= etOutputPath and sees that the output path (s3://outputbucket/path) already= exists, and throws a FileAlreadyExistsException from FileOutputFormat.chec= kOutputSpecs() - even though my ultimate subdirectory, to be constructed by= MultipleOutputs does not already exist.

Is there any way around this? I'm given hope by this, from http://hadoop.apache.org/docs/r1.= 0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html: "pub= lic class FileAlreadyExistsException extends IOException - Used when target= file already exists for any operation *and is not configured to be overwri= tten*" (my emphasis). Is it possible to deconfigure the overwrite prot= ection?

If not, I suppose one other way ahead is to create my own FileOutputFor= mat where the checkOutputSpecs() is a bit less fussy; another might be to w= rite to a "temp" directory and programmatically move it to the de= sired output when the job completes successfully, although this is getting = to feel a bit "hacky" to me.

Thanks for any feedback!

Tony







___= _____________________________________
From: Harsh J [harsh@cloudera.com]
Sent: 31 Au= gust 2012 10:47
To: user@hadoop= .apache.org
Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutp= utFormat

Good finding, that OF slipped my mind. We can mention on th= e MultipleOutputs javadocs for the new API to use the LazyOutputFormat for = the job-level config. Please file a JIRA for this under MAPREDUCE project o= n the Apache JIRA?

On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton <TBurton@sportingindex.com> = wrote:
> Hi Harsh,
>
> I tried using NullOutputFormat as = you suggested, however simply using
>
> job.setOutputFormatClass(NullOutputFormat.class);
>
&= gt; resulted in no output at all. Although I've not tried overriding ge= tOutputCommitter in NullOutputFormat as you suggested, I discovered LazyOut= putFormat which only writes when it has to, "the output file is create= d only when the first record is emitted for a given partition" (from &= quot;Hadoop: The Definitive Guide").
>
> Instead of
>
> job.setOutputFormatClass(TextOutput= Format.class);
>
> use LazyOutputFormat like this:
>
&= gt; LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
>
> So now my unnamed MultipleOutputs are handling to segmented re= sults, and LazyOutputFormat is suppressing the default output. Good job!>
> Tony
>
>
>
>
>
> ________= ________________________________
> From: Harsh J [harsh@cloudera.com]
> Sent: 29 August 2012 17:05
> To: user@hadoop.apache= .org
> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
&g= t;
> Hi Tony,
>
> On Wed, Aug 29, 2012 at 9:30 PM, Tony B= urton <TB= urton@sportingindex.com> wrote:
>> Success so far!
>>
>> I followed the example giv= en by Tom on the link to the MultipleOutputs.html API you suggested.
>= ;>
>> I implemented a WordCount MR job using hadoop 1.0.3 and s= egmented the output depending on word length: output to directory "sml= " for less than 10 characters, "med" for between 10 and 20 c= haracters, "lrg" otherwise.
>>
>> I used out.write(key, new IntWritable(sum), generateFi= lename(key,
>> sum)); to write the output, and generateFileName to= create the custom
>> directory name/filename. You need to provide= the start of the
>> filename as well otherwise your output files will be -r-00000,
= >> -r-00001 etc. (so, for example, return "sml/part"; etc)<= br>>
> Thanks for these notes, should come helpful for those who s= earch!
>
>> Also required: as Tom states, override Reducer.setup() to = create the MultipleOutputs. However, Tom's puzzle left for the reader i= s that you also need to override Reducer.cleanup() and call close() on your= MultipleOutputs object. Forget to do this and your segmented files will be= empty.
>
> Ah yes this is important. Non closure of files would have you = wait for
> an hour for data to get available to readers (open writer = lease expiry
> period).
>
>> One observation: although= it's not the end of the world, as well as my segmented output I also g= et a zero-size part-r-00000 file in the base of my output path. Is there an= y way to prevent creation of this file?
>
> Set the OutputFormat to NullOutputFormat.
>
> In c= ase you face issues doing this in new API (you may notice some odd
> = behavior) try to extend NullOutputFormat and in its getOutputCommitter
> method i.e.
> http://hadoop.apache.org= /common/docs/r1.0.3/api/org/apache/hadoop/mapr
> educe/lib/output= /NullOutputFormat.html#getOutputCommitter(org.apache.h
> adoop.mapreduce.TaskAttemptContext),
> return a FileOutputCommit= ter object. By default it returns a no-op
> OutputCommitter that may = not gel well with a file-based writer such as
> MultipleOutputs. Then= set this new OutputFormat as your job's output
> format.
>
>> Thanks again Harsh for pointing the way.>>
>> Tony
>>
>>
>>
>>=
>>
>>
>>
>> -----Original Message-----=
>> From: Tony Burton [mailto:TBurton@SportingIndex.com]
>> Sent: 29 = August 2012 11:38
>> To: user@hadoop.apache.org
>> Subject: RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat>>
>> Thanks Harsh! Will try it out and report back later.=
>>
>>
>> -----Original Message-----
>>= From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: 29 August 2012 11:12
>> To: user@hadoop.apache.org
>>= Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
>&g= t;
>> Hi Tony,
>>
>> Seeing your new question, I recal= led Tom's post to a user once, here:
>> https://groups.google.com/a/cloudera.org/d/msg/cdh-user/pdyVyydt5Ys/1
>> CaLukt4v1AJ
>>
>> This specific call allows you = to specify / characters in your name,
>> that gets translated into= creation of directories automatically:
>>
http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/map=
>> reduce/lib/output/MultipleOutputs.html#write(KEYOUT,%20VALUEOUT,%2= 0ja
>> va.lang.String) (The last argument is where you will need t= o specify
>> the path)
>>
>> Try it out and let = us know!
>>
>> On Tue, Aug 28, 2012 at 7:06 PM, Tony Burton <TBurton@sportingi= ndex.com> wrote:
>>> Hi Harsh
>>>
>>= ;> Thanks for the reply - my understanding is that with MultipleOutputs = I can write differently named files into the same target directory. With Mu= ltipleTextOutputFormat I was able to override the target directory name to = perform the segmentation, by overriding generateFileNameForKeyValue().
>>>
>>> Does the 1.0.3 MultipleOutputs give me the abi= lity to alter the target directory name as well as the file name?
>&g= t;>
>>> Thanks,
>>>
>>> Tony
>>>
>>>
>>>
>>> -----Original = Message-----
>>> From: Harsh J [mailto:harsh@cloudera.com]
>>> Sen= t: 28 August 2012 13:44
>>> To: user@hadoop.apache.org
>>> Subject: Re: hadoop 1.0.3 equi= valent of MultipleTextOutputFormat
>>>
>>> The Mult= iple*OutputFormat have been deprecated in favor of the
>>> generic MultipleOutputs API. Would using that instead work for= you?
>>>
>>> On Tue, Aug 28, 2012 at 6:05 PM, Tony= Burton <= TBurton@sportingindex.com> wrote:
>>>> Hi,
>>>>
>>>> I've seen = that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat is good for writ= ing results into (for example) different directories created on the fly. Ho= wever, now I'm implementing a MapReduce job using Hadoop 1.0.3, I see t= hat the new API no longer supports MultipleTextOutputFormat. Is there an eq= uivalent that I can use, or will it be supported in a future release?
>>>>
>>>> Thanks,
>>>>
>>= ;>> Tony
>>>>
>>>>
>>>> = *******************************************************************
>>>> *** This email and any attachments are confidential, prote= cted by
>>>> copyright and may be legally privileged. =A0If = you are not the
>>>> intended recipient, then the disseminat= ion or copying of this email is prohibited. If you have received this in er= ror, please notify the sender by replying by email and then delete the emai= l completely from your system. =A0Neither Sporting Index nor the sender acc= epts responsibility for any virus, or any other defect which might affect a= ny computer or IT system into which the email is received and/or opened. = =A0It is the responsibility of the recipient to scan the email and no respo= nsibility is accepted for any loss or damage arising in any way from receip= t or use of this email. =A0Sporting Index Ltd is a company registered in En= gland and Wales with company number 2636842, whose registered office is at = Gateway House, Milverton Street, London, SE11 4AP. =A0Sporting Index Ltd is= authorised and regulated by the UK Financial Services Authority (reg. no. = 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001). =A0Any = financial promotion contained herein has been issued and approved by Sporti= ng Index Ltd.
>>>>
>>>> Outbound email has been scanned for vi= ruses and SPAM
>>>>
>>>
>>>
>&= gt;>
>>> --
>>> Harsh J
>>> www.sportingindex.com
>>> Inbound Email has been scanned for viruses and SPAM
>>= ;> ********************************************************************<= br>>>> ** This email and any attachments are confidential, protect= ed by
>>> copyright and may be legally privileged. =A0If you are not the=
>>> intended recipient, then the dissemination or copying of t= his email is prohibited. If you have received this in error, please notify = the sender by replying by email and then delete the email completely from y= our system. =A0Neither Sporting Index nor the sender accepts responsibility= for any virus, or any other defect which might affect any computer or IT s= ystem into which the email is received and/or opened. =A0It is the responsi= bility of the recipient to scan the email and no responsibility is accepted= for any loss or damage arising in any way from receipt or use of this emai= l. =A0Sporting Index Ltd is a company registered in England and Wales with = company number 2636842, whose registered office is at Gateway House, Milver= ton Street, London, SE11 4AP. =A0Sporting Index Ltd is authorised and regul= ated by the UK Financial Services Authority (reg. no. 150404) and Gambling = Commission (reg. no. 000-027343-R-308898-001). =A0Any financial promotion c= ontained herein has been issued and approved by Sporting Index Ltd.
>>>
>>> Outbound email has been scanned for viruses an= d SPAM
>>
>>
>>
>> --
>> Harsh= J
>>
w= ww.sportingindex.com
>> Inbound Email has been scanned for viruses and SPAM
>> **= *******************************************************************
>= > * This email and any attachments are confidential, protected by
>> copyright and may be legally privileged. =A0If you are not the int= ended
>> recipient, then the dissemination or copying of this emai= l is prohibited. If you have received this in error, please notify the send= er by replying by email and then delete the email completely from your syst= em. =A0Neither Sporting Index nor the sender accepts responsibility for any= virus, or any other defect which might affect any computer or IT system in= to which the email is received and/or opened. =A0It is the responsibility o= f the recipient to scan the email and no responsibility is accepted for any= loss or damage arising in any way from receipt or use of this email. =A0Sp= orting Index Ltd is a company registered in England and Wales with company = number 2636842, whose registered office is at Gateway House, Milverton Stre= et, London, SE11 4AP. =A0Sporting Index Ltd is authorised and regulated by = the UK Financial Services Authority (reg. no. 150404) and Gambling Commissi= on (reg. no. 000-027343-R-308898-001). =A0Any financial promotion contained= herein has been issued and approved by Sporting Index Ltd.
>>
>> Outbound email has been scanned for viruses and SPAM>> www.s= portingindex.com Inbound Email has been scanned for viruses and
>= > SPAM
>> ******************************************************************= ***
>> * This email and any attachments are confidential, protecte= d by
>> copyright and may be legally privileged. =A0If you are not= the intended
>> recipient, then the dissemination or copying of this email is proh= ibited. If you have received this in error, please notify the sender by rep= lying by email and then delete the email completely from your system. =A0Ne= ither Sporting Index nor the sender accepts responsibility for any virus, o= r any other defect which might affect any computer or IT system into which = the email is received and/or opened. =A0It is the responsibility of the rec= ipient to scan the email and no responsibility is accepted for any loss or = damage arising in any way from receipt or use of this email. =A0Sporting In= dex Ltd is a company registered in England and Wales with company number 26= 36842, whose registered office is at Gateway House, Milverton Street, Londo= n, SE11 4AP. =A0Sporting Index Ltd is authorised and regulated by the UK Fi= nancial Services Authority (reg. no. 150404) and Gambling Commission (reg. = no. 000-027343-R-308898-001). =A0Any financial promotion contained herein h= as been issued and approved by Sporting Index Ltd.
>>
>> Outbound email has been scanned for viruses and SPAM>
>
>
> --
> Harsh J
> www.sportingindex.com
>= Inbound Email has been scanned for viruses and SPAM
> **********************************************************************=
> This email and any attachments are confidential, protected by
&= gt; copyright and may be legally privileged. =A0If you are not the intended= recipient, then the dissemination or copying of this email is prohibited. = If you have received this in error, please notify the sender by replying by= email and then delete the email completely from your system. =A0Neither Sp= orting Index nor the sender accepts responsibility for any virus, or any ot= her defect which might affect any computer or IT system into which the emai= l is received and/or opened. =A0It is the responsibility of the recipient t= o scan the email and no responsibility is accepted for any loss or damage a= rising in any way from receipt or use of this email. =A0Sporting Index Ltd = is a company registered in England and Wales with company number 2636842, w= hose registered office is at Gateway House, Milverton Street, London, SE11 = 4AP. =A0Sporting Index Ltd is authorised and regulated by the UK Financial = Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-= 027343-R-308898-001). =A0Any financial promotion contained herein has been = issued and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM


--
Harsh J
www.sportingindex.com
Inbound Email has been scanned for viru= ses and SPAM
**********************************************************************
T= his email and any attachments are confidential, protected by copyright and = may be legally privileged. =A0If you are not the intended recipient, then t= he dissemination or copying of this email is prohibited. If you have receiv= ed this in error, please notify the sender by replying by email and then de= lete the email completely from your system. =A0Neither Sporting Index nor t= he sender accepts responsibility for any virus, or any other defect which m= ight affect any computer or IT system into which the email is received and/= or opened. =A0It is the responsibility of the recipient to scan the email a= nd no responsibility is accepted for any loss or damage arising in any way = from receipt or use of this email. =A0Sporting Index Ltd is a company regis= tered in England and Wales with company number 2636842, whose registered of= fice is at Gateway House, Milverton Street, London, SE11 4AP. =A0Sporting I= ndex Ltd is authorised and regulated by the UK Financial Services Authority= (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-00= 1). =A0Any financial promotion contained herein has been issued and approve= d by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM www.sportingindex.com Inbound = Email has been scanned for viruses and SPAM



= =A0

--
Alejandro <= /p>

=A0


***********************************************************= ******************
P Please consider the environment before printing this e= mail =


www.sportingindex.com

Inbound email has = been scanned for viruses & spam




--
Alejandro
--14dae93403a90b0cab04d53b4236--