hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Custom FileOutputFormat / RecordWriter
Date Tue, 26 Jul 2011 19:07:19 GMT
Tom,

You can theoretically add N amounts of named outputs from a single
task itself, even from within the map() calls (addNamedOutputs or
addMultiNamedOutputs checks within itself for dupes, so you don't have
to). So yes, you can keep adding outputs and using them per-key, and
given your earlier details of how many that's gonna be, I think MO
would behave just fine with its cache of record writers.

Regarding your other question, there are certain restrictions to the
names provided to MultipleOutputs as a named output. Specifically,
they accept only [A-Za-z0-9] and auto-include an "_" if you are using
multi-named outputs. These may be going away in the future (0.23+) to
allow for more flexible naming, however.

On Tue, Jul 26, 2011 at 9:21 PM, Tom Melendez <tom@supertom.com> wrote:
> Hi Harsh,
>
> Cool, thanks for the details.  For anyone interested, with your tip
> and description I was able to find an example inside the "Hadoop in
> Action" (Chapter 7, p168) book.
>
> Another question, though, it doesn't look like MultipleOutputs will
> let me control the filename in a per-key (per map) manner.  So,
> basically, if my map receives a key of "mykey", I want my file to be
> "mykey-someotherstuff.foo" (this is a binary file).  Am I right about
> this?
>
> Thanks,
>
> Tom
>
> On Tue, Jul 26, 2011 at 1:34 AM, Harsh J <harsh@cloudera.com> wrote:
>> Tom,
>>
>> What I meant to say was that doing this is well supported with
>> existing API/libraries itself:
>>
>> - The class MultipleOutputs supports providing a filename for an
>> output. See MultipleOutputs.addNamedOutput usage [1].
>> - The type 'NullWritable' is a special writable that doesn't do
>> anything. So if its configured into the above filename addition as a
>> key-type, and you pass NullWritable.get() as the key in every write
>> operation, you will end up just writing the value part of (key,
>> value).
>> - This way you do not have to write a custom OutputFormat for your use-case.
>>
>> [1] - http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
>> (Also available for the new API, depending on which
>> version/distribution of Hadoop you are on)
>>
>> On Tue, Jul 26, 2011 at 3:36 AM, Tom Melendez <tom@supertom.com> wrote:
>>> Hi Harsh,
>>>
>>> Thanks for the response.  Unfortunately, I'm not following your response.  :-)
>>>
>>> Could you elaborate a bit?
>>>
>>> Thanks,
>>>
>>> Tom
>>>
>>> On Mon, Jul 25, 2011 at 2:10 PM, Harsh J <harsh@cloudera.com> wrote:
>>>> You can use MultipleOutputs (or MultiTextOutputFormat for direct
>>>> key-file mapping, but I'd still prefer the stable MultipleOutputs).
>>>> Your sinking Key can be of NullWritable type, and you can keep passing
>>>> an instance of NullWritable.get() to it in every cycle. This would
>>>> write just the value, while the filenames are added/sourced from the
>>>> key inside the mapper code.
>>>>
>>>> This, if you are not comfortable writing your own code and maintaining
>>>> it, I s'pose. Your approach is correct as well, if the question was
>>>> specifically that.
>>>>
>>>> On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez <tom@supertom.com> wrote:
>>>>> Hi Folks,
>>>>>
>>>>> Just doing a sanity check here.
>>>>>
>>>>> I have a map-only job, which produces a filename for a key and data as
>>>>> a value.  I want to write the value (data) into the key (filename) in
>>>>> the path specified when I run the job.
>>>>>
>>>>> The value (data) doesn't need any formatting, I can just write it to
>>>>> HDFS without modification.
>>>>>
>>>>> So, looking at this link (the Output Formats section):
>>>>>
>>>>> http://developer.yahoo.com/hadoop/tutorial/module5.html
>>>>>
>>>>> Looks like I want to:
>>>>> - create a new output format
>>>>> - override write, tell it not to call writekey as I don't want that written
>>>>> - new getRecordWriter method that use the key as the filename and
>>>>> calls my outputformat
>>>>>
>>>>> Sound reasonable?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Tom
>>>>>
>>>>> --
>>>>> ===================
>>>>> Skybox is hiring.
>>>>> http://www.skyboximaging.com/careers/jobs
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>>
>>> --
>>> ===================
>>> Skybox is hiring.
>>> http://www.skyboximaging.com/careers/jobs
>>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> ===================
> Skybox is hiring.
> http://www.skyboximaging.com/careers/jobs
>



-- 
Harsh J

Mime
View raw message