hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
Date Wed, 29 Aug 2012 16:05:39 GMT
Hi Tony,

On Wed, Aug 29, 2012 at 9:30 PM, Tony Burton <TBurton@sportingindex.com> wrote:
> Success so far!
>
> I followed the example given by Tom on the link to the MultipleOutputs.html API you suggested.
>
> I implemented a WordCount MR job using hadoop 1.0.3 and segmented the output depending
on word length: output to directory "sml" for less than 10 characters, "med" for between 10
and 20 characters, "lrg" otherwise.
>
> I used out.write(key, new IntWritable(sum), generateFilename(key, sum)); to write the
output, and generateFileName to create the custom directory name/filename. You need to provide
the start of the filename as well otherwise your output files will be -r-00000, -r-00001 etc.
(so, for example, return "sml/part"; etc)

Thanks for these notes, should come helpful for those who search!

> Also required: as Tom states, override Reducer.setup() to create the MultipleOutputs.
However, Tom's puzzle left for the reader is that you also need to override Reducer.cleanup()
and call close() on your MultipleOutputs object. Forget to do this and your segmented files
will be empty.

Ah yes this is important. Non closure of files would have you wait for
an hour for data to get available to readers (open writer lease expiry
period).

> One observation: although it's not the end of the world, as well as my segmented output
I also get a zero-size part-r-00000 file in the base of my output path. Is there any way to
prevent creation of this file?

Set the OutputFormat to NullOutputFormat.

In case you face issues doing this in new API (you may notice some odd
behavior) try to extend NullOutputFormat and in its getOutputCommitter
method i.e. http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/mapreduce/lib/output/NullOutputFormat.html#getOutputCommitter(org.apache.hadoop.mapreduce.TaskAttemptContext),
return a FileOutputCommitter object. By default it returns a no-op
OutputCommitter that may not gel well with a file-based writer such as
MultipleOutputs. Then set this new OutputFormat as your job's output
format.

> Thanks again Harsh for pointing the way.
>
> Tony
>
>
>
>
>
>
>
> -----Original Message-----
> From: Tony Burton [mailto:TBurton@SportingIndex.com]
> Sent: 29 August 2012 11:38
> To: user@hadoop.apache.org
> Subject: RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
>
> Thanks Harsh! Will try it out and report back later.
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: 29 August 2012 11:12
> To: user@hadoop.apache.org
> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
>
> Hi Tony,
>
> Seeing your new question, I recalled Tom's post to a user once, here:
> https://groups.google.com/a/cloudera.org/d/msg/cdh-user/pdyVyydt5Ys/1CaLukt4v1AJ
>
> This specific call allows you to specify / characters in your name,
> that gets translated into creation of directories automatically:
> http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html#write(KEYOUT,%20VALUEOUT,%20java.lang.String)
> (The last argument is where you will need to specify the path)
>
> Try it out and let us know!
>
> On Tue, Aug 28, 2012 at 7:06 PM, Tony Burton <TBurton@sportingindex.com> wrote:
>> Hi Harsh
>>
>> Thanks for the reply - my understanding is that with MultipleOutputs I can write
differently named files into the same target directory. With MultipleTextOutputFormat I was
able to override the target directory name to perform the segmentation, by overriding generateFileNameForKeyValue().
>>
>> Does the 1.0.3 MultipleOutputs give me the ability to alter the target directory
name as well as the file name?
>>
>> Thanks,
>>
>> Tony
>>
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: 28 August 2012 13:44
>> To: user@hadoop.apache.org
>> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
>>
>> The Multiple*OutputFormat have been deprecated in favor of the generic
>> MultipleOutputs API. Would using that instead work for you?
>>
>> On Tue, Aug 28, 2012 at 6:05 PM, Tony Burton <TBurton@sportingindex.com> wrote:
>>> Hi,
>>>
>>> I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat is good
for writing results into (for example) different directories created on the fly. However,
now I'm implementing a MapReduce job using Hadoop 1.0.3, I see that the new API no longer
supports MultipleTextOutputFormat. Is there an equivalent that I can use, or will it be supported
in a future release?
>>>
>>> Thanks,
>>>
>>> Tony
>>>
>>>
>>> **********************************************************************
>>> This email and any attachments are confidential, protected by copyright and may
be legally privileged.  If you are not the intended recipient, then the dissemination or copying
of this email is prohibited. If you have received this in error, please notify the sender
by replying by email and then delete the email completely from your system.  Neither Sporting
Index nor the sender accepts responsibility for any virus, or any other defect which might
affect any computer or IT system into which the email is received and/or opened.  It is the
responsibility of the recipient to scan the email and no responsibility is accepted for any
loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is
a company registered in England and Wales with company number 2636842, whose registered office
is at Gateway House, Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised
and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission
(reg. no. 000-027343-R-308898-001).  Any financial promotion contained herein has been issued
>>> and approved by Sporting Index Ltd.
>>>
>>> Outbound email has been scanned for viruses and SPAM
>>>
>>
>>
>>
>> --
>> Harsh J
>> www.sportingindex.com
>> Inbound Email has been scanned for viruses and SPAM
>> **********************************************************************
>> This email and any attachments are confidential, protected by copyright and may be
legally privileged.  If you are not the intended recipient, then the dissemination or copying
of this email is prohibited. If you have received this in error, please notify the sender
by replying by email and then delete the email completely from your system.  Neither Sporting
Index nor the sender accepts responsibility for any virus, or any other defect which might
affect any computer or IT system into which the email is received and/or opened.  It is the
responsibility of the recipient to scan the email and no responsibility is accepted for any
loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is
a company registered in England and Wales with company number 2636842, whose registered office
is at Gateway House, Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised
and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission
(reg. no. 000-027343-R-308898-001).  Any financial promotion contained herein has been issued
>> and approved by Sporting Index Ltd.
>>
>> Outbound email has been scanned for viruses and SPAM
>
>
>
> --
> Harsh J
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM
> **********************************************************************
> This email and any attachments are confidential, protected by copyright and may be legally
privileged.  If you are not the intended recipient, then the dissemination or copying of this
email is prohibited. If you have received this in error, please notify the sender by replying
by email and then delete the email completely from your system.  Neither Sporting Index nor
the sender accepts responsibility for any virus, or any other defect which might affect any
computer or IT system into which the email is received and/or opened.  It is the responsibility
of the recipient to scan the email and no responsibility is accepted for any loss or damage
arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered
in England and Wales with company number 2636842, whose registered office is at Gateway House,
Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and regulated by the
UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001).
 Any financial promotion contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM
> **********************************************************************
> This email and any attachments are confidential, protected by copyright and may be legally
privileged.  If you are not the intended recipient, then the dissemination or copying of this
email is prohibited. If you have received this in error, please notify the sender by replying
by email and then delete the email completely from your system.  Neither Sporting Index nor
the sender accepts responsibility for any virus, or any other defect which might affect any
computer or IT system into which the email is received and/or opened.  It is the responsibility
of the recipient to scan the email and no responsibility is accepted for any loss or damage
arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered
in England and Wales with company number 2636842, whose registered office is at Gateway House,
Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and regulated by the
UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001).
 Any financial promotion contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM



-- 
Harsh J

Mime
View raw message