Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 32BD8D239 for ; Fri, 2 Nov 2012 17:48:18 +0000 (UTC) Received: (qmail 12232 invoked by uid 500); 2 Nov 2012 17:48:13 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 12144 invoked by uid 500); 2 Nov 2012 17:48:13 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 12137 invoked by uid 99); 2 Nov 2012 17:48:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Nov 2012 17:48:13 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates 209.85.223.176 as permitted sender) Received: from [209.85.223.176] (HELO mail-ie0-f176.google.com) (209.85.223.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Nov 2012 17:48:08 +0000 Received: by mail-ie0-f176.google.com with SMTP id k11so6202001iea.35 for ; Fri, 02 Nov 2012 10:47:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=tbMwD3ondcn2GHbqpK5+paSVUFpH8JfqyLoBZcffEqE=; b=IV1areTyMhwRB8zHNZmOrkStmsKRSve2KHqGjTJvHBnQ5KZFzptAN90M+x3UzYSlOX h/O5QJjje04fqZx7Nfb8FpNyKsRVhXpl/eg40Xk+g4UqufA67cR6+/YioER7AMfSwO+p Him2//HA4sA79tkHaJ+q/Se5zemXtxEdY02NIH1bDVqm1fuwF4pgMII/INFkp7AQyCAL yqomhd9A+84Hkf8WSeA1nb6fbnmUYr2dKan+3JrNuEWh0dpaVgDyJvTcvgtEjyizx2M7 jHdbS7vckQAfMr/vZaRBISD9VadmW+i/SS+ntAkhJsshUA7A/H2e8blGMibZjpbLZpfq fq4Q== Received: by 10.50.183.167 with SMTP id en7mr2445777igc.49.1351878467722; Fri, 02 Nov 2012 10:47:47 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.27.8 with HTTP; Fri, 2 Nov 2012 10:47:27 -0700 (PDT) In-Reply-To: References: From: Harsh J Date: Fri, 2 Nov 2012 23:17:27 +0530 Message-ID: Subject: Re: OutputFormat and Reduce Task To: user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQktBsbL5cfUS+BsW/dU6fwDMtuTGLz6Xd0dgNB4EhO0xRuMzUB7Wtx5xOb2raktfSJubdT1 X-Virus-Checked: Checked by ClamAV on apache.org Yes, only once per task attempt. On Fri, Nov 2, 2012 at 11:05 PM, Dhruv wrote: > Thanks Harsh, just to be clear--if I have a large key set and if I run with > just one reducer which is the default, the OutputFormat and the RecordWriter > will be constructed only once? > > > > > On Thu, Nov 1, 2012 at 8:14 PM, Harsh J wrote: >> >> Hi Dhruv, >> >> Inline. >> >> On Fri, Nov 2, 2012 at 4:15 AM, Dhruv wrote: >> > I'm trying to optimize the performance of my OutputFormat's >> > implementation. >> > I'm doing things similar to HBase's TableOutputFormat--sending the >> > reducer's >> > output to a distributed k-v store. So, the context.write() call >> > basically >> > winds up doing a Put() on the store. >> > >> > Although I haven't profiled, a sequence of thread dumps on the reduce >> > tasks >> > reveal that the threads are RUNNABLE and hanging out in the put() and >> > its >> > subsequent method calls. So, I proceeded to decouple these two by >> > implementing the producer (context.write()) consumer >> > (RecordWriter.write()) >> > pattern using ExecutorService. >> >> With HBase involved, this is only partly correct. The HTable API, >> which regular TableOutputFormat uses, provides a "AutoFlush" option >> which if disabled, begins to buffer writes to regionservers instead of >> doing a flush of Puts/Deletes at every single invoke. >> >> The TableOutputFormat by default does disable AutoFlush, to provide >> this behavior. >> >> Read more on that at >> >> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#setAutoFlush(boolean,%20boolean) >> and/or in Lars' book, "HBase: The Definitive Guide". >> >> > My understanding is that Context.write() calls RecordWriter.write() and >> > that >> > these two are synchronous calls. The first will block until the second >> > method completes.Each reduce phase blocks until the context.write() >> > finishes, so the next reduce on the next key also blocks, making things >> > run >> > slow in my case. Is this correct? >> >> Given the above explanation, this is untrue if HBase's >> TableOutputFormat is involved, but true otherwise for general FS >> interacting OFs. >> >> > Does this mean that OutputFormat is >> > instantiated once by the TaskTracker for the Job's reduce logic and all >> > keys >> > operated on by the reducers get the same instance of the OutputFormat. >> > Or, >> > is it that for each key operated by the reducer, a new OutputFormat is >> > instantiated? >> >> The TaskTracker is a service daemon that does not execute any >> user-code. Only a single OutputFormat object is instantiated in a >> single Task. The RecordWriter wrapped in it too is only instantiated >> once per Task. >> >> > Thanks, >> > Dhruv >> >> >> >> -- >> Harsh J > > -- Harsh J