hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ning Zhang <>
Subject Re: Work around for using OR in Joins
Date Wed, 23 Mar 2011 04:58:51 GMT
Joins with OR conditions are not supported by Hive currently. I think even though you rewrite
the condition to use NOT and AND only, the results may be wrong. 
It is quite hard to implement joins of any tables with OR conditions in a MapReduce framework.
it is straightforward to implement it in nested-loop join, but due to the nature of distributed
processing, nested loop join cannot be implemented in an efficient and scalable way in MapReduce.
In nested-loop join, each mapper need to join a split of LHS table with the whole RHS table
which could be terabytes. 

The regular (reduce-side) join in Hive is essentially a sort-merge join operator. With that
in mind, it's hard to implement OR conditions in the sort-merge join. 

One exception is the map-side join, which assumes the RHS table is small and will be read
fully into each mapper. Currently map-side join in Hive is a hash-based join operator. You
can implement a nested-loop map-side join operator to enable any join conditions including

On Mar 22, 2011, at 1:39 AM, MIS wrote:

> Found it at  ** line
> no. 1122
> There is some concern mentioned that supporting OR would lead to data
> explosion. Is it discussed/documneted in a little more detail somewhere ? If
> so, some pointers towards the same will be helpful.
> Thanks,
> MIS.
> On Tue, Mar 22, 2011 at 1:19 PM, MIS <> wrote:
>> I want to use OR in the join expression, but it seems only AND is supported
>> as of now.
>> I have a work around though to use DeMorgan's law {C1 OR C2 = !(!C1 AND
>> !C2))} , but it would be nice if somebody can point me to the location in
>> code base that would need modification to support the OR in the join
>> expression.
>> Thanks,
>> MIS.

View raw message