hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Goden Yao <goden...@apache.org>
Subject Hive Left Join inequality condition
Date Sat, 05 Nov 2016 07:22:04 GMT
Hello!


Lately we have ran into the need to implement inequality JOIN in Hive, and
we could have easily done that with WHERE clause, if it was not the LEFT
join.
Basically, we wonder how people implement LEFT/RIGHT JOIN with inequality
conditions in Hive without loss of efficiency.
Thank you.
Example:

SELECT r_id,
       CASE WHEN table1.property_value = 'False' THEN FALSE
            WHEN table1.property_value = 'True' THEN TRUE
            WHEN r.rea <  rg.laa THEN FALSE
            WHEN r.rea >= rg.laa THEN TRUE
            ELSE FALSE END AS flag
  FROM rs r
  LEFT JOIN public.di_re rg
    ON r.re = rg.re
  LEFT JOIN public.tbl table1
    ON r.re = table1.re
   AND table1.property_name = 'xxxx'
   AND r.rea BETWEEN table1.begin_time AND table1.end_time
Error:
FAILED: SemanticException Line 0:-1 Both left and right aliases
encountered in JOIN ...

Ways to resolve:

   - Move inequality condition in WHERE clause:

   WHERE r.rea BETWEEN table1.begin_time AND table1.end_time
   WARNING: Affects query logic - filters all the table instead of
filtering LEFT JOIN clause only;

   - Move condition into SELECT field with CASE statement (if possible):

   SELECT r_id,
        CASE WHEN table1.property_value = 'False'
                  AND r.rea BETWEEN table1.begin_time AND
table1.end_time THEN FALSE
             WHEN table1.property_value = 'True'
                  AND r.rea BETWEEN table1.begin_time AND
table1.end_time THEN TRUE

   Not possible in every case;
   - Divide queries into two separate statements and UNION them: one query
   with WHERE filter and another query totally omitting the JOIN to table that
   needed inequality as well as omitting the ids from the first query:

   WITH stage AS (
   SELECT r_id,
        CASE WHEN table1.property_value = 'False' THEN FALSE
             WHEN table1.property_value = 'True' THEN TRUE
             WHEN r.rea <  rg.laa THEN FALSE
             WHEN r.rea >= rg.laa THEN TRUE
             ELSE FALSE END as flag
   FROM rs r
   LEFT JOIN public.di_re rg
     ON r.re = rg.re
   LEFT JOIN public.tbl table1
     ON r.region = table1.region
    AND table1.property_name = 'xxxx'
   WHERE r.rea BETWEEN table1.begin_time AND table1.end_time
   )
   SELECT * FROM stage
   UNION
   SELECT r_id,
        CASE WHEN r.rea <  rg.laa THEN FALSE
             WHEN r.rea >= rg.laa THEN TRUE
             ELSE FALSE END as flag
   FROM rs r
   LEFT JOIN public.di_re rg
     ON r.re = rg.re
   WHERE r_id NOT IN (SELECT DISTINCT r_id from stage)

   Very expensive in terms of calculation, but in some cases inevitable.

‚Äč

Mime
View raw message