hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furcy Pin <furcy....@flaminem.com>
Subject Re: LEFT JOIN and WHERE CLAUSE - How to handle
Date Wed, 23 Aug 2017 15:47:10 GMT
Ho, in that case...

(First I notice that you say you want all records in TXN but in the query
you give, you perform your join the other way round.)

This is a typical use case that SQL is not very good at handling...

The solutions I see are:

- use RANK as you suggested.
  Note that Hive is smart enough to optimize it correctly :
  if you use a filter WHERE Rank() < K, it will take the K first values in
RAM and ignore the rest, rather than ranking everything and filtering
afterwards.

- perform a GROUP BY TXN.ROW_NUM so you have only one line per transaction.
Aggregate the rest the way you like (max, max_by or arbitrary)
  To handle the missing EUR line, just add a (OR CURRENCY.CCY_CD IS NULL)
to your where clause (and reverse the left join).

- Most complicated way to implement, but simplest to use afterwards:
  replace your CURRENCY table with a table where you have only one row per
currency, and store the multiple start/end dates in an array of structs,
  then a UDF that given such array and a date returns the correct
description. It only works if you don't have too many description per
currency though.
  This is where Spark comes handy: it is simpler to define UDFs with it.



On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan <
ramasubramanian.narayanan@gmail.com> wrote:

> Hi,
>
> TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT.
> It needs to be equated.
>
>
> regards,
> Rams
>
> On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <furcy.pin@flaminem.com> wrote:
>
>> I would suggest to use a subquery
>>
>> WITH unique_currency AS (
>>   SELECT
>>     CCY_CD,
>>     MAX(CNTRY_DESC) as CNTRY_DESC
>>   FROM CURRENCY
>>   GROUP BY CCY_CD
>> )
>>
>> and then perform your left join on it.
>>
>> Some SQL engine (e.g. Presto) have aggregation functions like
>> arbitrary(col) that take any value and are a little less costly than a
>> max.
>> Sometimes, they also have functions like max_by(x, y)
>> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
>> would allow you to get the most recent description.
>>
>> It is a shame that this function is not included in Hive yet, but still
>> you can find some UDAF implementations on github
>> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>
>> .
>>
>>
>>
>> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
>> ramasubramanian.narayanan@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Need your suggestion on the below.
>>>
>>> Have two tables TXN and CURRENCY.
>>>
>>> Need all records in TXN and hence doing Left Join with CURRENCY.
>>>
>>> *Two problems :*
>>> 1. CURRENCY table may contain duplicate records hence it needs to be
>>> handled through RANK or some other function.
>>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
>>> clause' then we will loose the EUR records which should not happen.
>>>
>>> Please suggest a solution to over come both the problems. For duplicated
>>> records it is fine if we select any of the CNTRY_DESC.
>>>
>>> *Table : CURRENCY*
>>>
>>> *Table : TXN*
>>>
>>> *CCY_CD*
>>>
>>> *CNTRY_DESC*
>>>
>>> *EFF_ST_DT*
>>>
>>> *EFF_END_DT*
>>>
>>> *ROW_NUM*
>>>
>>> *CCY_CD*
>>>
>>> *TXN_DT*
>>>
>>> INR
>>>
>>> Indian Rupee
>>>
>>> 1-Jan-15
>>>
>>> 20-Feb-16
>>>
>>> 1
>>>
>>> INR
>>>
>>> 16-Feb-17
>>>
>>> INR
>>>
>>> Indian Rupee New
>>>
>>> 21-Feb-16
>>>
>>> 20-Feb-99
>>>
>>> 2
>>>
>>> USD
>>>
>>> 16-Feb-17
>>>
>>> USD
>>>
>>> US Dollar
>>>
>>> 1-Jan-15
>>>
>>> 20-Feb-16
>>>
>>> 3
>>>
>>> SGD
>>>
>>> 16-Feb-17
>>>
>>> SGD
>>>
>>> Singapore Dollar
>>>
>>> 1-Jan-15
>>>
>>> 20-Feb-17
>>>
>>> 4
>>>
>>> EUR
>>>
>>> 16-Feb-17
>>>
>>> SGD
>>>
>>> Singapore Dollar New
>>>
>>> 15-Feb-17
>>>
>>> 20-Feb-99
>>>
>>> SGD
>>>
>>> Singapore Dollar Latest
>>>
>>> 16-Feb-17
>>>
>>> 16-Feb-17
>>>
>>> *Expected Output*
>>>
>>> *ROW_NUM*
>>>
>>> *CCY_CD*
>>>
>>> *TXN_DT*
>>>
>>> *CNTRY_DESC*
>>>
>>> 1
>>>
>>> INR
>>>
>>> 16-Feb-17
>>>
>>> Indian Rupee
>>>
>>> 2
>>>
>>> USD
>>>
>>> 16-Feb-17
>>>
>>> US Dollar
>>>
>>> 3
>>>
>>> SGD
>>>
>>> 16-Feb-17
>>>
>>> Singapore Dollar Latest (Any of three valid valid is fine)
>>>
>>> 4
>>>
>>> EUR
>>>
>>> 16-Feb-17
>>>
>>> <Null>
>>>
>>>
>>>
>>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
>>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
>>> where
>>> TXN_DT between EFF_ST_DT and EFF_END_DT;
>>>
>>>
>>>
>>> This query will drop the "EUR" record because of the where clause used.
>>> It cannot be handled with case statement instead of 'where clause' as we
>>> have   more than one record for 'SGD' when  TXN_DT is 16-FEB.
>>>
>>> regards,
>>> Rams
>>>
>>
>>
>

Mime
View raw message