hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MIS <misapa...@gmail.com>
Subject Re: Work around for using OR in Joins
Date Wed, 23 Mar 2011 07:04:56 GMT
Ning, Thanks for the reply.
Yes. you are right. Using NOT and AND didn't work as expected.
I'll give a try in implementing nested-loop map-side join.

In the meanwhile, I brought out the expression using OR from inside the JOIN
expression to be used in the filtering expression {in the WHERE clause },
but I see some difference in the results produced with what is expected.
Since,  I'm not using an OUTER join, I expected desired results. Any idea
why is the OR expression in filtering not working as desired ? Any thought
on this are welcome.

Thanks,
MIS.


On Wed, Mar 23, 2011 at 10:28 AM, Ning Zhang <nzhang@fb.com> wrote:

> Joins with OR conditions are not supported by Hive currently. I think even
> though you rewrite the condition to use NOT and AND only, the results may be
> wrong.
>
> It is quite hard to implement joins of any tables with OR conditions in a
> MapReduce framework. it is straightforward to implement it in nested-loop
> join, but due to the nature of distributed processing, nested loop join
> cannot be implemented in an efficient and scalable way in MapReduce. In
> nested-loop join, each mapper need to join a split of LHS table with the
> whole RHS table which could be terabytes.
>
> The regular (reduce-side) join in Hive is essentially a sort-merge join
> operator. With that in mind, it's hard to implement OR conditions in the
> sort-merge join.
>
> One exception is the map-side join, which assumes the RHS table is small
> and will be read fully into each mapper. Currently map-side join in Hive is
> a hash-based join operator. You can implement a nested-loop map-side join
> operator to enable any join conditions including OR.
>
> On Mar 22, 2011, at 1:39 AM, MIS wrote:
>
> > Found it at  *org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.java* line
> > no. 1122
> > There is some concern mentioned that supporting OR would lead to data
> > explosion. Is it discussed/documneted in a little more detail somewhere ?
> If
> > so, some pointers towards the same will be helpful.
> >
> > Thanks,
> > MIS.
> >
> > On Tue, Mar 22, 2011 at 1:19 PM, MIS <misapache@gmail.com> wrote:
> >
> >> I want to use OR in the join expression, but it seems only AND is
> supported
> >> as of now.
> >> I have a work around though to use DeMorgan's law {C1 OR C2 = !(!C1 AND
> >> !C2))} , but it would be nice if somebody can point me to the location
> in
> >> code base that would need modification to support the OR in the join
> >> expression.
> >>
> >> Thanks,
> >> MIS.
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message