hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markovitz, Dudu" <dmarkov...@paypal.com>
Subject RE: Hive Left Join inequality condition
Date Sat, 05 Nov 2016 19:48:33 GMT
Ugly as hell, but should work.

Dudu



SELECT r_id,
       CASE WHEN table1.property_value = 'False' THEN FALSE
            WHEN table1.property_value = 'True' THEN TRUE
            WHEN r.rea <  rg.laa THEN FALSE
            WHEN r.rea >= rg.laa THEN TRUE
            ELSE FALSE END AS flag
  FROM rs r
  LEFT JOIN public.di_re rg
    ON r.re = rg.re
  LEFT JOIN (select    r.r_id, table1.property_value
            from    rs r
                    join public.tbl table1
                    ON r.re = table1.re
            where  table1.property_name = 'xxxx'
               AND r.rea BETWEEN table1.begin_time AND table1.end_time
            ) table1

   on r.r_id = table1.r_id

From: Goden Yao [mailto:godenyao@apache.org]
Sent: Saturday, November 05, 2016 9:22 AM
To: user@hive.apache.org
Subject: Hive Left Join inequality condition


Hello!

Lately we have ran into the need to implement inequality JOIN in Hive, and we could have easily
done that with WHERE clause, if it was not the LEFT join.
Basically, we wonder how people implement LEFT/RIGHT JOIN with inequality conditions in Hive
without loss of efficiency.
Thank you.
Example:

SELECT r_id,

       CASE WHEN table1.property_value = 'False' THEN FALSE

            WHEN table1.property_value = 'True' THEN TRUE

            WHEN r.rea <  rg.laa THEN FALSE

            WHEN r.rea >= rg.laa THEN TRUE

            ELSE FALSE END AS flag

  FROM rs r

  LEFT JOIN public.di_re rg

    ON r.re<http://r.re> = rg.re<http://rg.re>

  LEFT JOIN public.tbl table1

    ON r.re<http://r.re> = table1.re<http://table1.re>

   AND table1.property_name = 'xxxx'

   AND r.rea BETWEEN table1.begin_time AND table1.end_time

Error:

FAILED: SemanticException Line 0:-1 Both left and right aliases encountered in JOIN ...

Ways to resolve:
·         Move inequality condition in WHERE clause:

·     WHERE r.rea BETWEEN table1.begin_time AND table1.end_time

·     WARNING: Affects query logic - filters all the table instead of filtering LEFT JOIN
clause only;
·         Move condition into SELECT field with CASE statement (if possible):

·     SELECT r_id,

·          CASE WHEN table1.property_value = 'False'

·                    AND r.rea BETWEEN table1.begin_time AND  table1.end_time THEN FALSE

·               WHEN table1.property_value = 'True'

·                    AND r.rea BETWEEN table1.begin_time AND table1.end_time THEN TRUE
Not possible in every case;
·         Divide queries into two separate statements and UNION them: one query with WHERE
filter and another query totally omitting the JOIN to table that needed inequality as well
as omitting the ids from the first query:

·     WITH stage AS (

·     SELECT r_id,

·          CASE WHEN table1.property_value = 'False' THEN FALSE

·               WHEN table1.property_value = 'True' THEN TRUE

·               WHEN r.rea <  rg.laa THEN FALSE

·               WHEN r.rea >= rg.laa THEN TRUE

·               ELSE FALSE END as flag

·     FROM rs r

·     LEFT JOIN public.di_re rg

·       ON r.re<http://r.re> = rg.re<http://rg.re>

·     LEFT JOIN public.tbl table1

·       ON r.region = table1.region

·      AND table1.property_name = 'xxxx'

·     WHERE r.rea BETWEEN table1.begin_time AND table1.end_time

·     )

·     SELECT * FROM stage

·     UNION

·     SELECT r_id,

·          CASE WHEN r.rea <  rg.laa THEN FALSE

·               WHEN r.rea >= rg.laa THEN TRUE

·               ELSE FALSE END as flag

·     FROM rs r

·     LEFT JOIN public.di_re rg

·       ON r.re<http://r.re> = rg.re<http://rg.re>

·     WHERE r_id NOT IN (SELECT DISTINCT r_id from stage)
Very expensive in terms of calculation, but in some cases inevitable.
​
Mime
View raw message