Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (nike.apache.org: domain of chamibuddhika@gmail.com
 designates 209.85.216.170 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAPO7Hz_m6QJW-NJH3kAQisVap0nvL3fLtpi8Kzi=vw4ugpLT0g@mail.gmail.com>
References: 
 <CAPO7Hz-q_xf4tOcq=wLCBSeiSMVkWDtV_5bHugqCk6Xe4HjY0A@mail.gmail.com>
	<CAPO7Hz-Hv22r6b3aZX3w5mVdwmSm57P=NjkiPc+X8ZF6PAn98Q@mail.gmail.com>
	<CAPO7Hz_m6QJW-NJH3kAQisVap0nvL3fLtpi8Kzi=vw4ugpLT0g@mail.gmail.com>
Date: Thu, 17 Jan 2013 23:00:24 +0530
Message-ID: 
 <CAPO7Hz8T0_vqp+y5UnTahO_EUP2GYPun+9dEzJn8fhE0V-7AcA@mail.gmail.com>
Subject: Re: Incremental Data Processing With Hive UDAF
From: buddhika chamith <chamibuddhika@gmail.com>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=20cf3074b9489e4da404d37f596c

--20cf3074b9489e4da404d37f596c
Content-Type: text/plain; charset=ISO-8859-1

Hi All,

Greatly appreciate any feedback on this. May be this may sound infeasible.
Just wanted check with the experts on this. Anyway the problem of
incremental data processing is a very interesting one if it can be
accommodated for.

Best Regards
Buddhika

On Wed, Jan 16, 2013 at 12:36 PM, buddhika chamith
<chamibuddhika@gmail.com>wrote:

> Hi All,
>
> After digging in to the code more I realized that GroupbyOperator can be
> present at the map side of the computation as well, in which case it's
> doing partial computations. So in that case the terminate of UDAF will get
> called for partial results. However for the queries that I tried the
> terminate methods inside the UDAFs in GroupbyOperator at reduce side tree
> of the computation finishes with fully completed aggregation results as
> expected. Can be behaviour be expected in any query? (Reduce side computing
> fully aggregated result for any aggregation function)
>
> The problem I am having is that I need a point where previous aggregation
> results to be merged with the current run results. But since terminate can
> behave bit differently depending on whether it's in map side or reduce side
> would it make sense to selectively add this logic at reduce side based on
> some configuration property? (I see property mapred.task.is.map can be of
> potential use here).
>
> Also there needs to be some identifier to uniquely identify the
> aggregation UDAF in operator tree so that the previous aggregations can be
> fetched from the result cache using that identifier. Is there such
> possibility where aggregation function can be uniquely identified within
> the query?
>
> I realize this might be a long shot but I am still up for it if this is
> feasible albeit with some work. Or any other possible ways to achieve this
> is highly appreciated.
>
> Regards
> Buddhika
>
>
> On Mon, Jan 14, 2013 at 8:16 PM, buddhika chamith <chamibuddhika@gmail.com
> > wrote:
>
>> Any suggestions on this are greatly appreciated. Any one see major road
>> blocks on this?
>>
>> Regards
>> Buddhika
>>
>>
>> On Sat, Jan 12, 2013 at 10:31 AM, buddhika chamith <
>> chamibuddhika@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> In order to achieve above I am researching on the feasibility of using a
>>> set of custom UADFs for distributive aggregate operations (e.g: sum, count
>>> etc..). Idea is to incorporate some state persisted from earlier
>>> aggregations to the current aggregation value inside merge of the UDAF. For
>>> distributing state data I was thinking of utilizing Hadoop distributed
>>> cache. But I am not sure about how exactly UDAF's are executed at runtime.
>>> Would including the logic to add the persisted state to the current result
>>> at terminate() ensure that it would be added only once? (Assuming all the
>>> aggregations fan in at terminate. I may gotten it all wrong here. :)). Or
>>> is there better way of achieving the same?
>>>
>>> Regards
>>> Buddhika
>>>
>>
>>
>

--20cf3074b9489e4da404d37f596c
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi All,<br><br>Greatly appreciate any feedback on this. May be this may sou=
nd infeasible. Just wanted check with the experts on this. Anyway the probl=
em of incremental data processing is a very interesting one if it can be ac=
commodated for.<br>
<br>Best Regards<br>Buddhika<br><br><div class=3D"gmail_quote">On Wed, Jan =
16, 2013 at 12:36 PM, buddhika chamith <span dir=3D"ltr">&lt;<a href=3D"mai=
lto:chamibuddhika@gmail.com" target=3D"_blank">chamibuddhika@gmail.com</a>&=
gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi All,<br><br>After digging in to the code =
more I realized that GroupbyOperator can be present at the map side of the =
computation as well, in which case it&#39;s doing partial computations. So =
in that case the terminate of UDAF will get called for partial results. How=
ever for the queries that I tried the terminate methods inside the UDAFs in=
 GroupbyOperator at reduce side tree of the computation finishes with fully=
 completed aggregation results as expected. Can be behaviour be expected in=
 any query? (Reduce side computing fully aggregated result for any aggregat=
ion function)<br>

<br>The problem I am having is that I need a point where previous aggregati=
on results to be merged with the current run results. But since terminate c=
an behave bit differently depending on whether it&#39;s in map side or redu=
ce side would it make sense to selectively add this logic at reduce side ba=
sed on some configuration property? (I see property mapred.task.is.map can =
be of potential use here).<br>

<br>Also there needs to be some identifier to uniquely identify the aggrega=
tion UDAF in operator tree so that the previous aggregations can be fetched=
 from the result cache using that identifier. Is there such possibility whe=
re aggregation function can be uniquely identified within the query?<br>

<br>I realize this might be a long shot but I am still up for it if this is=
 feasible albeit with some work. Or any other possible ways to achieve this=
 is highly appreciated.<br><br>Regards<span class=3D"HOEnZb"><font color=3D=
"#888888"><br>
Buddhika</font></span><div class=3D"HOEnZb"><div class=3D"h5"><br><br><div =
class=3D"gmail_quote">
On Mon, Jan 14, 2013 at 8:16 PM, buddhika chamith <span dir=3D"ltr">&lt;<a =
href=3D"mailto:chamibuddhika@gmail.com" target=3D"_blank">chamibuddhika@gma=
il.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"=
margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Any suggestions on this are greatly appreciated. Any one see major road blo=
cks on this?<br><br>Regards<span><font color=3D"#888888"><br>Buddhika</font=
></span><div><div><br><br><div class=3D"gmail_quote">
On Sat, Jan 12, 2013 at 10:31 AM, buddhika chamith <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:chamibuddhika@gmail.com" target=3D"_blank">chamibuddhika@gm=
ail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi All,<br><br>In order to achieve above I a=
m researching on the feasibility of using a set of custom UADFs for distrib=
utive aggregate operations (e.g: sum, count etc..). Idea is to incorporate =
some state persisted from earlier aggregations to the current aggregation v=
alue inside merge of the UDAF. For distributing state data I was thinking o=
f utilizing Hadoop distributed cache. But I am not sure about how exactly U=
DAF&#39;s are executed at runtime. Would including the logic to add the per=
sisted state to the current result at terminate() ensure that it would be a=
dded only once? (Assuming all the aggregations fan in at terminate. I may g=
otten it all wrong here. :)). Or is there better way of achieving the same?=
<br>


<br>Regards<span><font color=3D"#888888"><br>Buddhika<br>
</font></span></blockquote></div><br>
</div></div></blockquote></div><br>
</div></div></blockquote></div><br>

--20cf3074b9489e4da404d37f596c--