hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Praveen Kumar K J V S <praveenkjvs.develo...@gmail.com>
Subject Re: Hive Join with distinct rows
Date Fri, 09 Nov 2012 17:03:18 GMT
Thank you very much Mark

Yes query1 is doing just fine, but using query1 I will not be able to get
the data in other columns in table T1



On Fri, Nov 9, 2012 at 10:04 PM, Mark Grover <grover.markgrover@gmail.com>wrote:

> I see. I re-read your first email and you would like to query "select all
> the unique ID's in T1 which are not in T2"
>
> Query 1 seems to be doing just fine so I would say that's the way to go. I
> personally use "IS" operator when comparing something with NULLs instead of
> "=".
>
> There are some optimizations you can read about like semi joins that might
> come in handy for this query or queries in the future.
>
> https://cwiki.apache.org/Hive/languagemanual-joins.html
>
> Mark
>
>
>
> On Fri, Nov 9, 2012 at 8:00 AM, Praveen Kumar K J V S <
> praveenkjvs.developer@gmail.com> wrote:
>
>> Thanks Mark, I do understand that how Hive works with Distinct keyword.
>>
>> What I was looking for is a solution for my requirement in Hive, I am not
>> an expert in SQL, hence looking for suggestions
>>
>>
>> On Fri, Nov 9, 2012 at 9:54 AM, Mark Grover <grover.markgrover@gmail.com>wrote:
>>
>>> Hi Praveen,
>>> Let's take an example:
>>> (from
>>> https://cwiki.apache.org/Hive/languagemanual-select.html#LanguageManualSelect-ALLandDISTINCTClauses
>>> )
>>>
>>> -- Print out contents of the table
>>> hive> SELECT col1, col2 FROM t1;
>>> 1 3
>>> 1 3
>>> 1 4
>>> 2 5
>>>
>>> -- Selects distinct col1, col2 tuple
>>> hive> SELECT DISTINCT col1, col2 FROM t1;
>>> 1 3
>>> 1 4
>>> 2 5
>>>
>>>
>>> Similar to the second query above, your Query 2 selects each of the
>>> distinct values for <id, url, timestamp> tuple possibly giving you multiple
>>> records for a given id on the left side of the join. Consequently you don't
>>> get the result you expect.
>>>
>>> Mark
>>>
>>> On Thu, Nov 8, 2012 at 6:15 PM, Praveen Kumar K J V S <
>>> praveenkjvs.developer@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have 2 tables, T1 and T2 of structure columns= {ID, url, timestamp}
>>>>
>>>> In T1 ID's are repeatable. But T2 ID's are kind of primary key hence
>>>> only unique values are present.
>>>>
>>>> I want to join both tables T1 & T2 such that select all the unique ID's
>>>> in T1 which are not in  T2
>>>>
>>>> I have written 2 queries:
>>>>
>>>> Query1: select distinct T1.ID from T1 LEFT OUTER JOIN T2 on T1.ID=T2.IDAND
>>>> T2.ID=NULL
>>>>
>>>> this gives me expected results, but with the below I am getting all the
>>>> rows in T1
>>>>
>>>> Query2: select distinct(T1.ID), T1.url, T1.timestamp from T1 LEFT
>>>> OUTER JOIN T2 on T1.ID=T2.ID AND T2.ID=NULL
>>>>
>>>> Can some one point me how to achieve: select all the unique ID's in T1
>>>> which are not in T2
>>>>
>>>> Thanks,
>>>> Praveen
>>>>
>>>>
>>>
>>
>

Mime
View raw message