hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Sichi <>
Subject Re: implementing moving average as a UDF
Date Tue, 22 Feb 2011 22:58:56 GMT
Yes, your query makes sense and should already work as expected.  The idea of HIVE-1994 is
that once the new annotation is available, we'll make a guarantee that your query as written
below will continue to work in the face of any new optimizer changes (with the downside being
that in some cases you won't be able to take advantage of such optimizer changes).

Each mapper or reducer gets its own instance of the UDF, so (a) you don't have to worry about
any unwanted sharing between them and (b) you have to make sure that your DISTRIBUTE/SORT
clauses are present and correct (Hive won't know anything about the dependency).

Long term, an implementation of the SQL/OLAP frameworks would be preferable since it would
allow Hive to fully understand the semantics and apply all relevant validations and optimizations
transparently, but in the meantime, stateful UDF's will be the duct tape.


On Feb 22, 2011, at 11:55 AM, Igor Tatarinov wrote:

> Thank you, John.
> It's not quite clear from the page whether my solution:
> 1. makes sense
> 2. works now
> 3. will work in the future if the issue is resolved/implemented
> Could you elaborate?
> Also, there is no mentioning of UDF object sharing (between mappers) in the current implementation.
Is this a problem? do I need to use ThreadLocal or something like that?
> On Tue, Feb 22, 2011 at 11:42 AM, John Sichi <> wrote:
> Please see the discussion in this JIRA issue:
> On Feb 21, 2011, at 10:45 PM, Igor Tatarinov wrote:
> > I would like to implement the moving average as a UDF (instead of a streaming reducer).
Here is what I am thinking. Please let me know if I am missing something here:
> >
> > SELECT product, date, mavg(product, price, 10)
> > FROM (
> >   SELECT *
> >   FROM prices
> >   DISTRIBUTE BY product
> >   SORT BY product, date
> > )
> >
> > I have to pass the key to mavg() because it has to detect when one product grouping
ends and another starts.
> >
> > Unfortunately, mavg will also need to maintain a state (moving sum and count). That's
where I am worried that Hive (Hadoop?) will use a single instance of my UDF to process concurrent
groupings and this idea won't work.
> >
> > Is that the main issue? Is there something I can do to fix that?
> >
> > Thanks!
> > igor
> >

View raw message