hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Häusler <mich...@akatose.de>
Subject collecting lists in non-natural order / SORT BY on columns which are not part of the final result set
Date Sat, 13 Jun 2015 13:50:22 GMT
Hi there,

imagine you have a table of time-series transactions, e.g.:
CREATE TABLE
    foobar (actor BIGINT, ts BIGINT, action STRING);

containing the rows:
1	2000	bar
1	1000	foo
2	1500	foo

An interesting query would be to get a denormalized view on all actions of an actor sorted
by timestamp:
1	[foo, bar]
2	[foo]

This would require collecting the actions in a list in non-natural order. A naive query would
look like this:
SET hive.map.aggr = false;
SELECT
    actor,
    COLLECT_LIST(action) AS actions
FROM
    foobar
GROUP BY
    actor
SORT BY
    actor, ts;

Unfortunately, as the column "ts" is not in the final result set, this produces a SemanticException
[Error 10004]: Line 9:11 Invalid table alias or column reference 'ts'.


I do understand that it is difficult to allow a global ORDER BY on column that is not in the
result set.
But the SORT BY only needs to ensure order within the reduce group. Is there a way to get
this behaviour in Hive?

Best regards
Michael

Mime
View raw message