hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: Tez reducer parallelism ..
Date Wed, 16 Mar 2016 21:39:21 GMT
> So you'r saying, since these windows  are part of a single SELECT
>projection they need to be serial?

Yes, with a full shuffle of the result so far for each new OVER().

>              row_number() OVER( PARTITION BY app, user, type ORDER BY ts
>) as a_number,
>              row_number() OVER( PARTITION BY day, app, user, type ORDER
>BY ts ) as type_rank,
>              row_number() OVER( PARTITION BY day, app, user   ORDER BY
>ts ) as  dau_rank,

You can write your own stateful windowing UDAF which does this over the
a subset window, provided your partition by is pretty wide.

compute_app_dau(app, day, user, type, ts) OVER(PARTITION BY app, user
ORDER BY day, type,
ts) as a_number
compute_type_dau(app, day, user, type, ts) OVER(PARTITION BY app, user
ORDER BY day, type, ts) as type_rank

etc.

It's upto your impl to remember the last row & take avantage of the
ordering to produce ranking along timestamps - hive will distribute by
app, user & sort by app, user, day, type, ts & feed into your UDAF.

If this is mostly done at the user+app tuple, the real skew should only be
the biggest user X app combo.

I'm guessing it can even be a single UDAF returning a Struct with all
numbers in one go, but I haven't dug deeper into
ISupportStreamingModeForWindowing yet.

Cheers,
Gopal



Mime
View raw message