Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6B0E6EE42 for ; Thu, 17 Jan 2013 17:30:55 +0000 (UTC) Received: (qmail 34992 invoked by uid 500); 17 Jan 2013 17:30:54 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 34934 invoked by uid 500); 17 Jan 2013 17:30:54 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 34926 invoked by uid 99); 17 Jan 2013 17:30:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jan 2013 17:30:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of chamibuddhika@gmail.com designates 209.85.216.170 as permitted sender) Received: from [209.85.216.170] (HELO mail-qc0-f170.google.com) (209.85.216.170) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jan 2013 17:30:45 +0000 Received: by mail-qc0-f170.google.com with SMTP id d42so453882qca.15 for ; Thu, 17 Jan 2013 09:30:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=FHqpWO6mS9ZgUr7cqvxO5P1ocYP6Flag8oKXWDZkTI8=; b=yRvX/KNnfIrvTjEdHkd4M8eLtvCx+F4UFmIxBhf91LmQVJrr/rJ1BKU8kIV/1evAgj RxeRkCIBqr+2i/gWsgLfjHxeUBfoeFJxJUxZ5HOKSU8h9TCcB+wlwN0IxX9tts/a7joq O1ayQ+gw/XpPQcOThB6oeUCgjOr/3KaWvVEMQ/AQlKLyXl9KxYt42QDbrcwfd1NCadW6 lPZ6gjizxRvDX+8xS2gx580lmg1xOZT7ICCTuL2k3UvkgWnEGzqJtXIaC0jrp+Pta0NV dl266HGdBGF3O7o2lYQnjRkFUgrZHe3O9KvRIFdOnup5+SA2A31AfSh4yChK8T7IAOPj foRg== MIME-Version: 1.0 X-Received: by 10.224.58.66 with SMTP id f2mr6573578qah.11.1358443824696; Thu, 17 Jan 2013 09:30:24 -0800 (PST) Received: by 10.49.120.103 with HTTP; Thu, 17 Jan 2013 09:30:24 -0800 (PST) In-Reply-To: References: Date: Thu, 17 Jan 2013 23:00:24 +0530 Message-ID: Subject: Re: Incremental Data Processing With Hive UDAF From: buddhika chamith To: user@hive.apache.org Content-Type: multipart/alternative; boundary=20cf3074b9489e4da404d37f596c X-Virus-Checked: Checked by ClamAV on apache.org --20cf3074b9489e4da404d37f596c Content-Type: text/plain; charset=ISO-8859-1 Hi All, Greatly appreciate any feedback on this. May be this may sound infeasible. Just wanted check with the experts on this. Anyway the problem of incremental data processing is a very interesting one if it can be accommodated for. Best Regards Buddhika On Wed, Jan 16, 2013 at 12:36 PM, buddhika chamith wrote: > Hi All, > > After digging in to the code more I realized that GroupbyOperator can be > present at the map side of the computation as well, in which case it's > doing partial computations. So in that case the terminate of UDAF will get > called for partial results. However for the queries that I tried the > terminate methods inside the UDAFs in GroupbyOperator at reduce side tree > of the computation finishes with fully completed aggregation results as > expected. Can be behaviour be expected in any query? (Reduce side computing > fully aggregated result for any aggregation function) > > The problem I am having is that I need a point where previous aggregation > results to be merged with the current run results. But since terminate can > behave bit differently depending on whether it's in map side or reduce side > would it make sense to selectively add this logic at reduce side based on > some configuration property? (I see property mapred.task.is.map can be of > potential use here). > > Also there needs to be some identifier to uniquely identify the > aggregation UDAF in operator tree so that the previous aggregations can be > fetched from the result cache using that identifier. Is there such > possibility where aggregation function can be uniquely identified within > the query? > > I realize this might be a long shot but I am still up for it if this is > feasible albeit with some work. Or any other possible ways to achieve this > is highly appreciated. > > Regards > Buddhika > > > On Mon, Jan 14, 2013 at 8:16 PM, buddhika chamith > wrote: > >> Any suggestions on this are greatly appreciated. Any one see major road >> blocks on this? >> >> Regards >> Buddhika >> >> >> On Sat, Jan 12, 2013 at 10:31 AM, buddhika chamith < >> chamibuddhika@gmail.com> wrote: >> >>> Hi All, >>> >>> In order to achieve above I am researching on the feasibility of using a >>> set of custom UADFs for distributive aggregate operations (e.g: sum, count >>> etc..). Idea is to incorporate some state persisted from earlier >>> aggregations to the current aggregation value inside merge of the UDAF. For >>> distributing state data I was thinking of utilizing Hadoop distributed >>> cache. But I am not sure about how exactly UDAF's are executed at runtime. >>> Would including the logic to add the persisted state to the current result >>> at terminate() ensure that it would be added only once? (Assuming all the >>> aggregations fan in at terminate. I may gotten it all wrong here. :)). Or >>> is there better way of achieving the same? >>> >>> Regards >>> Buddhika >>> >> >> > --20cf3074b9489e4da404d37f596c Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi All,

Greatly appreciate any feedback on this. May be this may sou= nd infeasible. Just wanted check with the experts on this. Anyway the probl= em of incremental data processing is a very interesting one if it can be ac= commodated for.

Best Regards
Buddhika

On Wed, Jan = 16, 2013 at 12:36 PM, buddhika chamith <chamibuddhika@gmail.com&= gt; wrote:
Hi All,

After digging in to the code = more I realized that GroupbyOperator can be present at the map side of the = computation as well, in which case it's doing partial computations. So = in that case the terminate of UDAF will get called for partial results. How= ever for the queries that I tried the terminate methods inside the UDAFs in= GroupbyOperator at reduce side tree of the computation finishes with fully= completed aggregation results as expected. Can be behaviour be expected in= any query? (Reduce side computing fully aggregated result for any aggregat= ion function)

The problem I am having is that I need a point where previous aggregati= on results to be merged with the current run results. But since terminate c= an behave bit differently depending on whether it's in map side or redu= ce side would it make sense to selectively add this logic at reduce side ba= sed on some configuration property? (I see property mapred.task.is.map can = be of potential use here).

Also there needs to be some identifier to uniquely identify the aggrega= tion UDAF in operator tree so that the previous aggregations can be fetched= from the result cache using that identifier. Is there such possibility whe= re aggregation function can be uniquely identified within the query?

I realize this might be a long shot but I am still up for it if this is= feasible albeit with some work. Or any other possible ways to achieve this= is highly appreciated.

Regards
Buddhika


On Mon, Jan 14, 2013 at 8:16 PM, buddhika chamith <chamibuddhika@gma= il.com> wrote:
Any suggestions on this are greatly appreciated. Any one see major road blo= cks on this?

Regards
Buddhika


On Sat, Jan 12, 2013 at 10:31 AM, buddhika chamith <chamibuddhika@gm= ail.com> wrote:
Hi All,

In order to achieve above I a= m researching on the feasibility of using a set of custom UADFs for distrib= utive aggregate operations (e.g: sum, count etc..). Idea is to incorporate = some state persisted from earlier aggregations to the current aggregation v= alue inside merge of the UDAF. For distributing state data I was thinking o= f utilizing Hadoop distributed cache. But I am not sure about how exactly U= DAF's are executed at runtime. Would including the logic to add the per= sisted state to the current result at terminate() ensure that it would be a= dded only once? (Assuming all the aggregations fan in at terminate. I may g= otten it all wrong here. :)). Or is there better way of achieving the same?=

Regards
Buddhika



--20cf3074b9489e4da404d37f596c--