Return-Path: Delivered-To: apmail-hadoop-hive-user-archive@minotaur.apache.org Received: (qmail 61165 invoked from network); 27 May 2010 01:36:21 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 27 May 2010 01:36:21 -0000 Received: (qmail 5952 invoked by uid 500); 27 May 2010 01:36:21 -0000 Delivered-To: apmail-hadoop-hive-user-archive@hadoop.apache.org Received: (qmail 5802 invoked by uid 500); 27 May 2010 01:36:21 -0000 Mailing-List: contact hive-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hive-user@hadoop.apache.org Delivered-To: mailing list hive-user@hadoop.apache.org Received: (qmail 5794 invoked by uid 99); 27 May 2010 01:36:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 May 2010 01:36:21 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ashutosh.chauhan@gmail.com designates 209.85.221.172 as permitted sender) Received: from [209.85.221.172] (HELO mail-qy0-f172.google.com) (209.85.221.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 May 2010 01:36:15 +0000 Received: by qyk2 with SMTP id 2so4241180qyk.20 for ; Wed, 26 May 2010 18:35:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=pAs1bjtOkBsXMOp5+7hOTSEVdVsVh61iRSxVYB5/Fuo=; b=MmxE9R0Vrut/GSHBjf4EJruXKxG4vJnmHKcl9R4YghyHPH/Wbnmk5mijxk+sNGtJ76 qX8JQwWoNGZ/GihQwM6rJwsalK/U4OH95RLsnWEeMAVkJ0xohp2+b6Lu8V+dKXh1XG+3 +ZocSSCAFk5BCh31bWKW/lAYotp6t8SkiiV4Q= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=ioQJ40pfsWY+niG++3Cin9Ou5yjJFEGr6P2I/SoJkuMOlwwcPQpULLyY++eyWNmT6S ikc95ye4kmdSm+ojwlLBaQ7/OAh3C3B5IpMNMd4wnc46iC2AAx7sYvRgNWaczgAo4FY+ rYelUXgfSIc1ylJkDoTMbBxmAjG2/AnuOSSvk= Received: by 10.229.187.9 with SMTP id cu9mr2103865qcb.172.1274924154141; Wed, 26 May 2010 18:35:54 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.91.212 with HTTP; Wed, 26 May 2010 18:35:34 -0700 (PDT) In-Reply-To: <3120E6F5005EE7419C125CE166D55E90068F00E763@SC-MBXC1.TheFacebook.com> References: <01E40EC0BE47384E865F73CA6F4A0F320DC8AE5C@EVS1.kokanee.abebooks.com> <3120E6F5005EE7419C125CE166D55E90068F00E763@SC-MBXC1.TheFacebook.com> From: Ashutosh Chauhan Date: Wed, 26 May 2010 18:35:34 -0700 Message-ID: Subject: Re: job level output committer in storage handler To: hive-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Thanks everyone for the reply. I think its HIVE-1225 is really what I want. At this point I can implement PostExecute as I need to call the hook only at the end of query and not at the end of each job or task of query. If I register it through hive-site.xml then I guess it will get executed for each query which is where the complication starts. I want to execute this hook only for insert queries and not for all the queries. One workaround is to get the Cmd string from session and then parse it to find out if it actually is an insert query and only if it is then execute the remainder of code. But that looks hacky, I look forward to HIVE-1225. Thanks, Ashutosh On Wed, May 26, 2010 at 10:35, John Sichi wrote: > I think we'll need to extend the StorageHandler interface so that it can = participate in the commit semantics (separate from the handler-independent = hooks Ashish mentioned). =A0That was the intention of this followup JIRA is= sue I logged as part of HBase integration work: > > https://issues.apache.org/jira/browse/HIVE-1225 > > To add this one, we need to determine what information needs to be passed= along to the storage handler now (and how to make it easy to pass along mo= re information as needed without having to change the interface in the futu= re). > > JVS > > ________________________________________ > From: Ning Zhang [nzhang@facebook.com] > Sent: Wednesday, May 26, 2010 10:22 AM > To: hive-user@hadoop.apache.org > Subject: Re: job level output committer in storage handler > > Hi Ashutosh, > > Hive doesn't use OutputCommitter explicitly because it handles commit and= abort by itself. > > If you are looking for task level committer where you want to do somethin= g after a task successfully finished, you can take a look at the FileSinkOp= erator.cloaseOp(). It renames tempFile to final file name which implement t= he commit semantics. > > If you are looking for job level committer where you want to do something= after the job (including all task) finished successfully, you can take a l= ook at the MoveTask implementation. The MoveTask is generated as a follow u= p task after a MR job for each insert overwrite statement. It moves the dir= ectory that contains the results from all finished tasks to its destination= path (e.g. a directory specified in the insert statement or inferred from = the table's storage location property). The MoveTask implements the commit = semantics of the whole job. > > Ning > > On May 26, 2010, at 9:16 AM, Ashutosh Chauhan wrote: > >> Hi Kortni, >> >> Thanks for your suggestion. But we cant use it in our setup. We are >> not spinning hive jobs in a separate process which we can monitor >> rather I want to get the handle on when job finishes in my storage >> handler / serde. >> >> Ashutosh >> >> On Tue, May 25, 2010 at 12:25, Kortni Smith wrote: >>> Hi Ashutosh , >>> >>> I'm not sure how to accomplish that on the hive side of things, but in = case >>> it helps I am writing because it sounds like you to know when your job = is >>> done so you can update something externally and my company will also be >>> implementing this in the near future. =A0Our plan is to have the proces= s that >>> kicks off our hive jobs in the cloud, to monitor each job status period= ically >>> using amazon's emr java library, and when their state changes to comple= te, >>> update our external systems accordingly. >>> >>> >>> Kortni Smith | Software Developer >>> AbeBooks.com =A0Passion for books. >>> >>> ksmith@abebooks.com >>> phone: 250.412.3272 =A0| =A0fax: 250.475.6014 >>> >>> Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5 >>> >>> www.abebooks.com =A0| =A0www.abebooks.co.uk =A0| =A0www.abebooks.de >>> www.abebooks.fr =A0| =A0www.abebooks.it =A0| =A0www.iberlibro.com >>> >>> -----Original Message----- >>> From: Ashutosh Chauhan [mailto:ashutosh.chauhan@gmail.com] >>> Sent: Tuesday, May 25, 2010 12:13 PM >>> To: hive-user@hadoop.apache.org >>> Subject: job level output committer in storage handler >>> >>> Hi, >>> >>> I am implementing my own serde and storage handler. Is there any >>> method in one of these interfaces (or any other) which give me a >>> handle to do some operation after all the records have been written by >>> all reducer. =A0Something very similar to job level output committer. I >>> want to update some state in an external system once I know job has >>> completed successfully. Ideally, I would do this kind of a thing in a >>> job level output committer, but since Hive is on old MR api, I dont >>> have access to that. =A0There is a Hive's RecordWriter#close() I tried >>> that but it looks like its a task level handle. So, every reducer will >>> try to update the state of my external system, which is not I want. >>> Any pointers on how to achieve this will be much appreciated. If its >>> unclear what I am asking for, let me know and I will provide more >>> details. >>> >>> Thanks, >>> Ashutosh >>> > >