hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alejandro Abdelnur <t...@cloudera.com>
Subject Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
Date Fri, 08 Feb 2013 19:06:04 GMT
Tony, I think the first step would be to verify if the S3 filesystem
implementation rename works as expected.

Thx


On Fri, Feb 1, 2013 at 7:12 AM, Tony Burton <TBurton@sportingindex.com>wrote:

> ** **
>
> Thanks for the reply Alejandro. Using a temp output directory was my first
> guess as well. What’s the best way to proceed? I’ve come across
> FileSystem.rename but it’s consistently returning false for whatever Paths
> I provide. Specifically, I need to copy the following:****
>
> ** **
>
> s3://<path to data>/<tmp folder>/<object type 1>/part-00000****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type 1>/part-nnnnn****
>
> s3://<path to data>/<tmp folder>/<object type 2>/part-00000****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type 2>/part-nnnnn****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type m>/part-nnnnn****
>
> ** **
>
> to ****
>
> ** **
>
> s3://<path to data>/<object type 1>/part-00000****
>
> …****
>
> s3://<path to data>/<object type 1>/part-nnnnn****
>
> s3://<path to data>/<object type 2>/part-00000****
>
> …****
>
> s3://<path to data>/<object type 2>/part-nnnnn****
>
> …****
>
> s3://<path to data>/<object type m>/part-nnnnn****
>
> ** **
>
> without doing a copyToLocal.****
>
> ** **
>
> Any tips? Are there any better alternatives to FileSystem.rename? Or would
> using the AWS Java SDK be a better solution?****
>
> ** **
>
> Thanks!****
>
> ** **
>
> Tony****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> *From:* Alejandro Abdelnur [mailto:tucu@cloudera.com]
> *Sent:* 31 January 2013 18:45
> *To:* common-user@hadoop.apache.org
>
> *Subject:* Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat****
>
> ** **
>
> Hi Tony, from what i understand your prob is not with MTOF but with you
> wanting to run 2 jobs using the same output directory, the second job will
> fail because the output dir already existed. My take would be tweaking your
> jobs to use a temp output dir, and moving them to the required (final)
> location upon completion.****
>
> ** **
>
> thx****
>
> ** **
>
> ** **
>
> On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <TBurton@sportingindex.com>
> wrote:****
>
> Hi everyone,
>
> Some of you might recall this topic, which I worked on with the list's
> help back in August last year - see email trail below. Despite initial
> success of the discovery, I had the shelve the approach as I ended up using
> a different solution (for reasons I forget!) with the implementation that
> was ultimately used for that particular project.
>
> I'm now in a position to be working on a similar new task, where I've
> successfully implemented the combination of LazyOutputFormat and
> MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output
> locations. However, I've hit another snag which I'm hoping you might help
> me work through.
>
> I'm going to be running daily tasks to extract data from XML files
> (specifically, the data stored in certain nodes of the XML), stored on AWS
> S3 using object names with the following format:
>
> s3://inputbucket/data/2013/1/13/<list of xml data files.bz2>
>
> I want to extract items from the XML and write out as follows:
>
> s3://outputbucket/path/<xml node name>/20130113/<output from MR job>
>
> For one day of data, this works fine. I pass in s3://inputbucket/data and
> s3://outputbucket/path as input and output arguments, along with my run
> date (20130113) which gets manipulated and appended where appropriate to
> form the precise read and write locations, for example
>
> FileInputFormat.setInputhPath(job, " s3://inputbucket/data");
> FileOutputFormat.setOutputPath(job, "s3://outputbucket/path");
>
> Then MultipleOutputs adds on my XML node names underneath
> s3://outputbucket/path automatically.
>
> However, for the next day's run, the job gets to
> FileOutputFormat.setOutputPath and sees that the output path
> (s3://outputbucket/path) already exists, and throws a
> FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even
> though my ultimate subdirectory, to be constructed by MultipleOutputs does
> not already exist.
>
> Is there any way around this? I'm given hope by this, from
> http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html:
> "public class FileAlreadyExistsException extends IOException - Used when
> target file already exists for any operation *and is not configured to be
> overwritten*" (my emphasis). Is it possible to deconfigure the overwrite
> protection?
>
> If not, I suppose one other way ahead is to create my own FileOutputFormat
> where the checkOutputSpecs() is a bit less fussy; another might be to write
> to a "temp" directory and programmatically move it to the desired output
> when the job completes successfully, although this is getting to feel a bit
> "hacky" to me.
>
> Thanks for any feedback!
>
> Tony
>
>
>
>
>
>
>
> ________________________________________
> From: Harsh J [harsh@cloudera.com]
> Sent: 31 August 2012 10:47
> To: user@hadoop.apache.org
> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
>
> Good finding, that OF slipped my mind. We can mention on the
> MultipleOutputs javadocs for the new API to use the LazyOutputFormat for
> the job-level config. Please file a JIRA for this under MAPREDUCE project
> on the Apache JIRA?
>
> On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton <TBurton@sportingindex.com>
> wrote:
> > Hi Harsh,
> >
> > I tried using NullOutputFormat as you suggested, however simply using
> >
> > job.setOutputFormatClass(NullOutputFormat.class);
> >
> > resulted in no output at all. Although I've not tried overriding
> getOutputCommitter in NullOutputFormat as you suggested, I discovered
> LazyOutputFormat which only writes when it has to, "the output file is
> created only when the first record is emitted for a given partition" (from
> "Hadoop: The Definitive Guide").
> >
> > Instead of
> >
> > job.setOutputFormatClass(TextOutputFormat.class);
> >
> > use LazyOutputFormat like this:
> >
> > LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
> >
> > So now my unnamed MultipleOutputs are handling to segmented results, and
> LazyOutputFormat is suppressing the default output. Good job!
> >
> > Tony
> >
> >
> >
> >
> >
> > ________________________________________
> > From: Harsh J [harsh@cloudera.com]
> > Sent: 29 August 2012 17:05
> > To: user@hadoop.apache.org
> > Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >
> > Hi Tony,
> >
> > On Wed, Aug 29, 2012 at 9:30 PM, Tony Burton <TBurton@sportingindex.com>
> wrote:
> >> Success so far!
> >>
> >> I followed the example given by Tom on the link to the
> MultipleOutputs.html API you suggested.
> >>
> >> I implemented a WordCount MR job using hadoop 1.0.3 and segmented the
> output depending on word length: output to directory "sml" for less than 10
> characters, "med" for between 10 and 20 characters, "lrg" otherwise.
> >>
> >> I used out.write(key, new IntWritable(sum), generateFilename(key,
> >> sum)); to write the output, and generateFileName to create the custom
> >> directory name/filename. You need to provide the start of the
> >> filename as well otherwise your output files will be -r-00000,
> >> -r-00001 etc. (so, for example, return "sml/part"; etc)
> >
> > Thanks for these notes, should come helpful for those who search!
> >
> >> Also required: as Tom states, override Reducer.setup() to create the
> MultipleOutputs. However, Tom's puzzle left for the reader is that you also
> need to override Reducer.cleanup() and call close() on your MultipleOutputs
> object. Forget to do this and your segmented files will be empty.
> >
> > Ah yes this is important. Non closure of files would have you wait for
> > an hour for data to get available to readers (open writer lease expiry
> > period).
> >
> >> One observation: although it's not the end of the world, as well as my
> segmented output I also get a zero-size part-r-00000 file in the base of my
> output path. Is there any way to prevent creation of this file?
> >
> > Set the OutputFormat to NullOutputFormat.
> >
> > In case you face issues doing this in new API (you may notice some odd
> > behavior) try to extend NullOutputFormat and in its getOutputCommitter
> > method i.e.
> > http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/mapr
> > educe/lib/output/NullOutputFormat.html#getOutputCommitter(org.apache.h
> > adoop.mapreduce.TaskAttemptContext),
> > return a FileOutputCommitter object. By default it returns a no-op
> > OutputCommitter that may not gel well with a file-based writer such as
> > MultipleOutputs. Then set this new OutputFormat as your job's output
> > format.
> >
> >> Thanks again Harsh for pointing the way.
> >>
> >> Tony
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Tony Burton [mailto:TBurton@SportingIndex.com]
> >> Sent: 29 August 2012 11:38
> >> To: user@hadoop.apache.org
> >> Subject: RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>
> >> Thanks Harsh! Will try it out and report back later.
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: 29 August 2012 11:12
> >> To: user@hadoop.apache.org
> >> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>
> >> Hi Tony,
> >>
> >> Seeing your new question, I recalled Tom's post to a user once, here:
> >> https://groups.google.com/a/cloudera.org/d/msg/cdh-user/pdyVyydt5Ys/1
> >> CaLukt4v1AJ
> >>
> >> This specific call allows you to specify / characters in your name,
> >> that gets translated into creation of directories automatically:
> >> http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/map
> >> reduce/lib/output/MultipleOutputs.html#write(KEYOUT,%20VALUEOUT,%20ja
> >> va.lang.String) (The last argument is where you will need to specify
> >> the path)
> >>
> >> Try it out and let us know!
> >>
> >> On Tue, Aug 28, 2012 at 7:06 PM, Tony Burton <TBurton@sportingindex.com>
> wrote:
> >>> Hi Harsh
> >>>
> >>> Thanks for the reply - my understanding is that with MultipleOutputs I
> can write differently named files into the same target directory. With
> MultipleTextOutputFormat I was able to override the target directory name
> to perform the segmentation, by overriding generateFileNameForKeyValue().
> >>>
> >>> Does the 1.0.3 MultipleOutputs give me the ability to alter the target
> directory name as well as the file name?
> >>>
> >>> Thanks,
> >>>
> >>> Tony
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Harsh J [mailto:harsh@cloudera.com]
> >>> Sent: 28 August 2012 13:44
> >>> To: user@hadoop.apache.org
> >>> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>>
> >>> The Multiple*OutputFormat have been deprecated in favor of the
> >>> generic MultipleOutputs API. Would using that instead work for you?
> >>>
> >>> On Tue, Aug 28, 2012 at 6:05 PM, Tony Burton <
> TBurton@sportingindex.com> wrote:
> >>>> Hi,
> >>>>
> >>>> I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
> is good for writing results into (for example) different directories
> created on the fly. However, now I'm implementing a MapReduce job using
> Hadoop 1.0.3, I see that the new API no longer supports
> MultipleTextOutputFormat. Is there an equivalent that I can use, or will it
> be supported in a future release?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Tony
> >>>>
> >>>>
> >>>> *******************************************************************
> >>>> *** This email and any attachments are confidential, protected by
> >>>> copyright and may be legally privileged.  If you are not the
> >>>> intended recipient, then the dissemination or copying of this email
> is prohibited. If you have received this in error, please notify the sender
> by replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>>>
> >>>> Outbound email has been scanned for viruses and SPAM
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >>> www.sportingindex.com
> >>> Inbound Email has been scanned for viruses and SPAM
> >>> ********************************************************************
> >>> ** This email and any attachments are confidential, protected by
> >>> copyright and may be legally privileged.  If you are not the
> >>> intended recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>>
> >>> Outbound email has been scanned for viruses and SPAM
> >>
> >>
> >>
> >> --
> >> Harsh J
> >> www.sportingindex.com
> >> Inbound Email has been scanned for viruses and SPAM
> >> *********************************************************************
> >> * This email and any attachments are confidential, protected by
> >> copyright and may be legally privileged.  If you are not the intended
> >> recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >> www.sportingindex.com Inbound Email has been scanned for viruses and
> >> SPAM
> >> *********************************************************************
> >> * This email and any attachments are confidential, protected by
> >> copyright and may be legally privileged.  If you are not the intended
> >> recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >
> >
> >
> > --
> > Harsh J
> > www.sportingindex.com
> > Inbound Email has been scanned for viruses and SPAM
> > **********************************************************************
> > This email and any attachments are confidential, protected by
> > copyright and may be legally privileged.  If you are not the intended
> recipient, then the dissemination or copying of this email is prohibited.
> If you have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system.  Neither
> Sporting Index nor the sender accepts responsibility for any virus, or any
> other defect which might affect any computer or IT system into which the
> email is received and/or opened.  It is the responsibility of the recipient
> to scan the email and no responsibility is accepted for any loss or damage
> arising in any way from receipt or use of this email.  Sporting Index Ltd
> is a company registered in England and Wales with company number 2636842,
> whose registered office is at Gateway House, Milverton Street, London, SE11
> 4AP.  Sporting Index Ltd is authorised and regulated by the UK Financial
> Services Authority (reg. no. 150404) and Gambling Commission (reg. no.
> 000-027343-R-308898-001).  Any financial promotion contained herein has
> been issued and approved by Sporting Index Ltd.
> >
> > Outbound email has been scanned for viruses and SPAM
>
>
>
> --
> Harsh J
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM
> **********************************************************************
> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Gateway House, Milverton Street, London, SE11 4AP.
>  Sporting Index Ltd is authorised and regulated by the UK Financial
> Services Authority (reg. no. 150404) and Gambling Commission (reg. no.
> 000-027343-R-308898-001).  Any financial promotion contained herein has
> been issued and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM www.sportingindex.comInbound Email
has been scanned for viruses and SPAM
> ****
>
>
>
> ****
>
> ** **
>
> --
> Alejandro ****
>
> ** **
>
>
>
> *****************************************************************************
> P *Please consider the environment before printing this email* ****
>
>
> www.sportingindex.com
>
> Inbound email has been scanned for viruses & spam****
>



-- 
Alejandro

Mime
View raw message