hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: OutputFormat and Reduce Task
Date Fri, 02 Nov 2012 17:47:27 GMT
Yes, only once per task attempt.

On Fri, Nov 2, 2012 at 11:05 PM, Dhruv <dhruv21@gmail.com> wrote:
> Thanks Harsh, just to be clear--if I have a large key set and if I run with
> just one reducer which is the default, the OutputFormat and the RecordWriter
> will be constructed only once?
>
>
>
>
> On Thu, Nov 1, 2012 at 8:14 PM, Harsh J <harsh@cloudera.com> wrote:
>>
>> Hi Dhruv,
>>
>> Inline.
>>
>> On Fri, Nov 2, 2012 at 4:15 AM, Dhruv <dhruv21@gmail.com> wrote:
>> > I'm trying to optimize the performance of my OutputFormat's
>> > implementation.
>> > I'm doing things similar to HBase's TableOutputFormat--sending the
>> > reducer's
>> > output to a distributed k-v store. So, the context.write() call
>> > basically
>> > winds up doing a Put() on the store.
>> >
>> > Although I haven't profiled, a sequence of thread dumps on the reduce
>> > tasks
>> > reveal that the threads are RUNNABLE and hanging out in the put() and
>> > its
>> > subsequent method calls. So, I proceeded to decouple these two by
>> > implementing the producer (context.write()) consumer
>> > (RecordWriter.write())
>> > pattern using ExecutorService.
>>
>> With HBase involved, this is only partly correct. The HTable API,
>> which regular TableOutputFormat uses, provides a "AutoFlush" option
>> which if disabled, begins to buffer writes to regionservers instead of
>> doing a flush of Puts/Deletes at every single invoke.
>>
>> The TableOutputFormat by default does disable AutoFlush, to provide
>> this behavior.
>>
>> Read more on that at
>>
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#setAutoFlush(boolean,%20boolean)
>> and/or in Lars' book, "HBase: The Definitive Guide".
>>
>> > My understanding is that Context.write() calls RecordWriter.write() and
>> > that
>> > these two are synchronous calls. The first will block until the second
>> > method completes.Each reduce phase blocks until the context.write()
>> > finishes, so the next reduce on the next key also blocks, making things
>> > run
>> > slow in my case. Is this correct?
>>
>> Given the above explanation, this is untrue if HBase's
>> TableOutputFormat is involved, but true otherwise for general FS
>> interacting OFs.
>>
>> > Does this mean that OutputFormat is
>> > instantiated once by the TaskTracker for the Job's reduce logic and all
>> > keys
>> > operated on by the reducers get the same instance of the OutputFormat.
>> > Or,
>> > is it that for each key operated by the reducer, a new OutputFormat is
>> > instantiated?
>>
>> The TaskTracker is a service daemon that does not execute any
>> user-code. Only a single OutputFormat object is instantiated in a
>> single Task. The RecordWriter wrapped in it too is only instantiated
>> once per Task.
>>
>> > Thanks,
>> > Dhruv
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Mime
View raw message