hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Premal Shah <>
Subject DISTRIBUTE BY to distribute joins or maybe something else?
Date Thu, 16 Oct 2014 08:21:38 GMT
I have 2 tables that are inner joined and the join keys have low
cardinality and high volume within each key ie. key1 on both sides
sometimes have millions of rows

When joining, which happens in the reduce stage, the tasks take forever
since there are too many keys to join.

I tried using DISTRIBUTE BY using another field assuming that the data will
get partitioned on that key on the left table (effectively using more
reducers) and steam the table on the right side. But taht does not seem to
work. Is DISTRIBUTE BY the wrong thing to use for this use case?

Is there any other way to partition the tables so that the joins are faster?

Premal Shah.

View raw message