hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ning Zhang <nzh...@fb.com>
Subject Re: Work around for using OR in Joins
Date Wed, 23 Mar 2011 07:24:17 GMT
What is your OR conditions? Are they involving both tables? Can you post your query here?

On Mar 23, 2011, at 12:04 AM, MIS wrote:

> Ning, Thanks for the reply.
> Yes. you are right. Using NOT and AND didn't work as expected.
> I'll give a try in implementing nested-loop map-side join.
> 
> In the meanwhile, I brought out the expression using OR from inside the JOIN
> expression to be used in the filtering expression {in the WHERE clause },
> but I see some difference in the results produced with what is expected.
> Since,  I'm not using an OUTER join, I expected desired results. Any idea
> why is the OR expression in filtering not working as desired ? Any thought
> on this are welcome.
> 
> Thanks,
> MIS.
> 
> 
> On Wed, Mar 23, 2011 at 10:28 AM, Ning Zhang <nzhang@fb.com> wrote:
> 
>> Joins with OR conditions are not supported by Hive currently. I think even
>> though you rewrite the condition to use NOT and AND only, the results may be
>> wrong.
>> 
>> It is quite hard to implement joins of any tables with OR conditions in a
>> MapReduce framework. it is straightforward to implement it in nested-loop
>> join, but due to the nature of distributed processing, nested loop join
>> cannot be implemented in an efficient and scalable way in MapReduce. In
>> nested-loop join, each mapper need to join a split of LHS table with the
>> whole RHS table which could be terabytes.
>> 
>> The regular (reduce-side) join in Hive is essentially a sort-merge join
>> operator. With that in mind, it's hard to implement OR conditions in the
>> sort-merge join.
>> 
>> One exception is the map-side join, which assumes the RHS table is small
>> and will be read fully into each mapper. Currently map-side join in Hive is
>> a hash-based join operator. You can implement a nested-loop map-side join
>> operator to enable any join conditions including OR.
>> 
>> On Mar 22, 2011, at 1:39 AM, MIS wrote:
>> 
>>> Found it at  *org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.java* line
>>> no. 1122
>>> There is some concern mentioned that supporting OR would lead to data
>>> explosion. Is it discussed/documneted in a little more detail somewhere ?
>> If
>>> so, some pointers towards the same will be helpful.
>>> 
>>> Thanks,
>>> MIS.
>>> 
>>> On Tue, Mar 22, 2011 at 1:19 PM, MIS <misapache@gmail.com> wrote:
>>> 
>>>> I want to use OR in the join expression, but it seems only AND is
>> supported
>>>> as of now.
>>>> I have a work around though to use DeMorgan's law {C1 OR C2 = !(!C1 AND
>>>> !C2))} , but it would be nice if somebody can point me to the location
>> in
>>>> code base that would need modification to support the OR in the join
>>>> expression.
>>>> 
>>>> Thanks,
>>>> MIS.
>>>> 
>> 
>> 


Mime
View raw message