hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furcy Pin <furcy....@flaminem.com>
Subject Re: LEFT JOIN and WHERE CLAUSE - How to handle
Date Wed, 23 Aug 2017 14:25:39 GMT
I would suggest to use a subquery

WITH unique_currency AS (
  SELECT
    CCY_CD,
    MAX(CNTRY_DESC) as CNTRY_DESC
  FROM CURRENCY
  GROUP BY CCY_CD
)

and then perform your left join on it.

Some SQL engine (e.g. Presto) have aggregation functions like arbitrary(col)
that take any value and are a little less costly than a max.
Sometimes, they also have functions like max_by(x, y)
<https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
would allow you to get the most recent description.

It is a shame that this function is not included in Hive yet, but still you
can find some UDAF implementations on github
<https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>.



On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
ramasubramanian.narayanan@gmail.com> wrote:

> Hi,
>
> Need your suggestion on the below.
>
> Have two tables TXN and CURRENCY.
>
> Need all records in TXN and hence doing Left Join with CURRENCY.
>
> *Two problems :*
> 1. CURRENCY table may contain duplicate records hence it needs to be
> handled through RANK or some other function.
> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
> clause' then we will loose the EUR records which should not happen.
>
> Please suggest a solution to over come both the problems. For duplicated
> records it is fine if we select any of the CNTRY_DESC.
>
> *Table : CURRENCY*
>
> *Table : TXN*
>
> *CCY_CD*
>
> *CNTRY_DESC*
>
> *EFF_ST_DT*
>
> *EFF_END_DT*
>
> *ROW_NUM*
>
> *CCY_CD*
>
> *TXN_DT*
>
> INR
>
> Indian Rupee
>
> 1-Jan-15
>
> 20-Feb-16
>
> 1
>
> INR
>
> 16-Feb-17
>
> INR
>
> Indian Rupee New
>
> 21-Feb-16
>
> 20-Feb-99
>
> 2
>
> USD
>
> 16-Feb-17
>
> USD
>
> US Dollar
>
> 1-Jan-15
>
> 20-Feb-16
>
> 3
>
> SGD
>
> 16-Feb-17
>
> SGD
>
> Singapore Dollar
>
> 1-Jan-15
>
> 20-Feb-17
>
> 4
>
> EUR
>
> 16-Feb-17
>
> SGD
>
> Singapore Dollar New
>
> 15-Feb-17
>
> 20-Feb-99
>
> SGD
>
> Singapore Dollar Latest
>
> 16-Feb-17
>
> 16-Feb-17
>
> *Expected Output*
>
> *ROW_NUM*
>
> *CCY_CD*
>
> *TXN_DT*
>
> *CNTRY_DESC*
>
> 1
>
> INR
>
> 16-Feb-17
>
> Indian Rupee
>
> 2
>
> USD
>
> 16-Feb-17
>
> US Dollar
>
> 3
>
> SGD
>
> 16-Feb-17
>
> Singapore Dollar Latest (Any of three valid valid is fine)
>
> 4
>
> EUR
>
> 16-Feb-17
>
> <Null>
>
>
>
> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
> where
> TXN_DT between EFF_ST_DT and EFF_END_DT;
>
>
>
> This query will drop the "EUR" record because of the where clause used.
> It cannot be handled with case statement instead of 'where clause' as we
> have   more than one record for 'SGD' when  TXN_DT is 16-FEB.
>
> regards,
> Rams
>

Mime
View raw message