hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furcy Pin <furcy....@flaminem.com>
Subject Help needed: Out of memory with windowing functions
Date Wed, 20 Aug 2014 12:34:52 GMT
Hi all,

I have an event table with (user_id, timestamp, event)
and I'm trying to write a query to get the first 10 events for each user.


My query goes like this :

SELECT user_id, event
FROM
(
SELECT
user_id,
event,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY time) as rownum
FROM eventTable
) T
WHERE rownum <= 10

However, the table may contain millions of events for the same user and I'm
getting
an OutOfMemory Error in the reduce phase, inside the following method:

org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber$RowNumberBuffer.incr(GenericUDAFRowNumber.java:80)



It seems that the windowing functions were designed to store a Buffer
containing all

results for each "PARTITION", and writes everything once all rows of
that partition

have been read.


This make windowing with Hive not very scalable...


My questions are:


a) Is there a reason why it was implemented this way rather than in a
"streaming" fashion?


b) Do you know how I could rewrite the query to avoid the problem (if
possible without having to write my own UDF)?



Thanks,


Furcy

Mime
View raw message