hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Häusler <mich...@akatose.de>
Subject Re: collecting lists in non-natural order / SORT BY on columns which are not part of the final result set
Date Tue, 16 Jun 2015 16:27:04 GMT
Hi Robin,

thanks a lot for the hint about the WITH statement. This is comparable to using a subselect.
E.g., I could rewrite my query from:
SELECT
    actor,
    COLLECT_LIST(action) AS actions
FROM
    foobar
GROUP BY
    actor
SORT BY
    actor, ts;

to:
SELECT
    actor,
    COLLECT_LIST(action) AS actions
FROM
    (
        SELECT
            actor,
            ts,
            action
        FROM
            foobar
        ORDER BY
            actor,
            ts
    ) foobar_sorted
GROUP BY
    actor;

This has several disadvantages:
a) we have two MR jobs instead of one
b) the global ORDER BY is forcing data to flow through a single reducer, which increases latency
c) when reading the sorted results from the first job, data for the same reduce group can
come from two different input splits. Correct sorting is not guaranteed for these reduce groups.

I created a JIRA for a new UDAF COLLECT_LIST_SORTED:
https://issues.apache.org/jira/browse/HIVE-11022

Best regards
Michael



> On 2015-06-13, at 16:52, Robin Verlangen <robin@us2.nl> wrote:
> 
> Hi Michael,
> 
> You can try using a with statement, pseudo:
> 
> WITH input AS (SELECT colA, colB FROM table ORDER BY colA ASC)
> SELECT colB FROM input
> 
> Best regards, 
> 
> Robin Verlangen
> Chief Data Architect
> 
> W http://www.robinverlangen.nl
> E robin@us2.nl
> 
> 
> What is CloudPelican?
> 
> Disclaimer: The information contained in this message and attachments is intended solely
for the attention and use of the named addressee and may be confidential. If you are not the
intended recipient, you are reminded that the information remains the property of the sender.
You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received
this message in error, please contact the sender immediately and irrevocably delete this message
and any copies.
> 
> On Sat, Jun 13, 2015 at 3:50 PM, Michael Häusler <michael@akatose.de> wrote:
> Hi there,
> 
> imagine you have a table of time-series transactions, e.g.:
> CREATE TABLE
>     foobar (actor BIGINT, ts BIGINT, action STRING);
> 
> containing the rows:
> 1       2000    bar
> 1       1000    foo
> 2       1500    foo
> 
> An interesting query would be to get a denormalized view on all actions of an actor sorted
by timestamp:
> 1       [foo, bar]
> 2       [foo]
> 
> This would require collecting the actions in a list in non-natural order. A naive query
would look like this:
> SET hive.map.aggr = false;
> SELECT
>     actor,
>     COLLECT_LIST(action) AS actions
> FROM
>     foobar
> GROUP BY
>     actor
> SORT BY
>     actor, ts;
> 
> Unfortunately, as the column "ts" is not in the final result set, this produces a SemanticException
[Error 10004]: Line 9:11 Invalid table alias or column reference 'ts'.
> 
> 
> I do understand that it is difficult to allow a global ORDER BY on column that is not
in the result set.
> But the SORT BY only needs to ensure order within the reduce group. Is there a way to
get this behaviour in Hive?
> 
> Best regards
> Michael
> 


Mime
View raw message