hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furcy Pin <>
Subject Help needed: Out of memory with windowing functions
Date Wed, 20 Aug 2014 12:34:52 GMT
Hi all,

I have an event table with (user_id, timestamp, event)
and I'm trying to write a query to get the first 10 events for each user.

My query goes like this :

SELECT user_id, event
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY time) as rownum
FROM eventTable
) T
WHERE rownum <= 10

However, the table may contain millions of events for the same user and I'm
an OutOfMemory Error in the reduce phase, inside the following method:


It seems that the windowing functions were designed to store a Buffer
containing all

results for each "PARTITION", and writes everything once all rows of
that partition

have been read.

This make windowing with Hive not very scalable...

My questions are:

a) Is there a reason why it was implemented this way rather than in a
"streaming" fashion?

b) Do you know how I could rewrite the query to avoid the problem (if
possible without having to write my own UDF)?



View raw message