hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From comptech geeky <comptechge...@gmail.com>
Subject Re: Something wrong with my query to get TOP 3?
Date Thu, 19 Jul 2012 23:05:59 GMT
Hi Igor,

I am new to HiveQL world. Don't know that much basically. Currently I have
my Rank UDF function like this-

*public final class Rank extends UDF{*
*    private int  counter;*
*    private String last_key;*
*    public int evaluate(final String key){*
*  if ( !key.equalsIgnoreCase(this.last_key) ) {*
*     this.counter = 0;*
*     this.last_key = key;*
*  }*
*  return this.counter++;*
*    }*
*}*
*
*
And I tried that query after removing pid from distribute by and sort by
clause, but I got the below output which is wrong again-

*1345653 330760137950    0*
*1345653 330760137950    1*
*1345653 330760137950    2*

But I need output something like this-

*1345653    330760137950       2012-07-09 21:43:29*
*1345653    330760137950       2012-07-09 21:42:29*
*1345653    330760137950       2012-07-09 21:41:29*
1345653    110909316904       2012-07-09 21:29:06
1345653    221065796761       2012-07-09 19:31:48


Any help will be appreciated.




On Thu, Jul 19, 2012 at 4:00 PM, Igor Tatarinov <igor@decide.com> wrote:

> Actually, never mind. Looks like you need to partition by both bid and
> pid. In that case, your problem is that rank() has to handle a combined
> bid+pid key. So first you need to create a combined key, partition by that
> key and pass it to your rank() function (assuming rank() knows to reset on
> a new key). You can cast bid and pid to string and concatenate them with a
> separator (bid_pid) to get a single partitioning key. Hope this makes sense.
>
> On Thu, Jul 19, 2012 at 3:57 PM, Igor Tatarinov <igor@decide.com> wrote:
>
>> Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT clauses.
>> Your very first query was correct except for the nested subquery part.
>> (You don't need a double-nested subquery.)
>>
>> On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky <comptechgeeky@gmail.com>wrote:
>>
>>> Hi Igor,
>>>
>>> I am not sure what I have to remove from Distribute By as in distribute
>>> by we have bid, pid and you said remove bid and time from distribute by and
>>> it doesn't have time
>>>
>>> *SELECT bid, pid, rank FROM *
>>>       *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)
>>> FROM
>>> *
>>>   *
>>>            ( SELECT bid, pid, time FROM table1*
>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>>> '2012-07-09' *
>>> *            DISTRIBUTE BY bid,pid*
>>> *            SORT BY bid,pid, time desc) A
>>> *
>>> *
>>>       ) B
>>> *
>>> *WHERE rank < 3;*
>>>
>>>
>>> And also I tried running the above query as it is. I am not getting
>>> expected output instead of that I am getting output like this which is
>>> wrong If you compare my expected output with the below output-
>>>
>>>  *1345653 110909316904    0*
>>> *1345653 221065796761    1*
>>> *1345653 330760137950    2*
>>>
>>>
>>> On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov <igor@decide.com> wrote:
>>>
>>>> Remove pid,time from DISTRIBUTE BY.
>>>>
>>>> On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky <
>>>> comptechgeeky@gmail.com> wrote:
>>>>
>>>>> Modified Query that I wrote and its not working as expected output is.
>>>>>
>>>>> *
>>>>> *
>>>>> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
>>>>> *FROM (*
>>>>> *    SELECT bid, pid, time*
>>>>> *    FROM table1*
>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>>>>> '2012-07-09'*
>>>>> *    DISTRIBUTE BY bid,pid,time*
>>>>> *    SORT BY bid, time desc*
>>>>> *) a*
>>>>> *WHERE rank(bid) < 3;*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky <
>>>>> comptechgeeky@gmail.com> wrote:
>>>>>
>>>>>> I wrote this query after modifying it-
>>>>>>
>>>>>> *SELECT buyer_id, item_id, rank(buyer_id), created_time,
>>>>>> UNIX_TIMESTAMP(created_time)*
>>>>>> *FROM (*
>>>>>> *    SELECT buyer_id, item_id, created_time*
>>>>>> *    FROM testingtable1*
>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>>> int))) = '2012-07-09'*
>>>>>> *    DISTRIBUTE BY buyer_id,item_id*
>>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>> *) a*
>>>>>> *WHERE rank(buyer_id) < 3;*
>>>>>>
>>>>>> And the output I got is which is sligtly wrong as it is missing two
>>>>>> rows-
>>>>>>
>>>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>>
>>>>>> These two rows are missing-
>>>>>>
>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>
>>>>>>  So full output should be like this-
>>>>>>
>>>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <
>>>>>> comptechgeeky@gmail.com> wrote:
>>>>>>
>>>>>>> Can you show me the exact query that I need to do for this
>>>>>>> particular problem consideing my scenario? It will be of great
help to me.
>>>>>>> As I am new to HiveQL.
>>>>>>>
>>>>>>> I need TOP 3 for those if BID and PID gets matched but with
>>>>>>> different timestamp.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
>>>>>>> philip.j.tromans@gmail.com> wrote:
>>>>>>>
>>>>>>>> Your rank() is being evaluated map side. Put your distribute
by and
>>>>>>>> sort by in an inner query, and then evaluate your rank()
in an outer query.
>>>>>>>>
>>>>>>>> Phil.
>>>>>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <comptechgeeky@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This is the below data in my Table1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BID       PID                       TIME
>>>>>>>>>
>>>>>>>>> --------------+-------------------------+--------------------------------
>>>>>>>>> 1345653   330760137950       2012-07-09 21:42:29
>>>>>>>>> 1345653   330760137950       2012-07-09 21:43:29
>>>>>>>>> 1345653   330760137950       2012-07-09 21:40:29
>>>>>>>>> 1345653   330760137950       2012-07-09 21:41:29
>>>>>>>>> 1345653   110909316904       2012-07-09 21:29:06
>>>>>>>>> 1345653   221065796761       2012-07-09 19:31:48
>>>>>>>>>
>>>>>>>>> So If I need to clarify the above scenario- I have data
in above
>>>>>>>>> table like this-
>>>>>>>>> For USER *`1345653` *I have this PID `*330760137950`
*four times
>>>>>>>>> but with different timestamps in red color. So I need
the output something
>>>>>>>>> like this-
>>>>>>>>>
>>>>>>>>> Output that I need:-
>>>>>>>>>
>>>>>>>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>>>>>>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>>>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>>>>
>>>>>>>>> So Basically If BID and PID are same but with different
>>>>>>>>> timestamps, then I need TOP 3 sorted with TIME in descending
order
>>>>>>>>>
>>>>>>>>> And for this I created rank UDF (User Defined Function).
And I
>>>>>>>>> wrote the below query but its not working for me. Can
anyone help me on
>>>>>>>>> this?
>>>>>>>>> *
>>>>>>>>> *
>>>>>>>>> *
>>>>>>>>> *
>>>>>>>>> *SELECT buyer_id, item_id, created_time*
>>>>>>>>> *    FROM table1*
>>>>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time)
>>>>>>>>> as int))) = '2012-07-09'*
>>>>>>>>> *    DISTRIBUTE BY buyer_id*
>>>>>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message