hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Storey <storey.j...@gmail.com>
Subject Improving self join time
Date Thu, 20 Mar 2014 14:03:49 GMT
I have a table with 10 million rows and 2 columns - id (int) and element
(string). I am trying to do a self join that finds any ids where the
element values are the same, and my query looks like:

select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1 JOIN
elements e2 on e1.element = e2.element WHERE e1.id < e2.id;

I tested this at a smaller scale and it works well. The problem is that
with 10 million rows, this becomes a bit large and I've let it run for 90
minutes and it was up to 80GB of disk space and still going. The original
input data was only 500MB.

Is this something I can optimize in hive? Or should I be considering a
different approach to the problem instead?

Any guidance here would be helpful. Thank you.

Mime
View raw message