hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wouter de Bie <wou...@spotify.com>
Subject Re: Find TOP 10 using HiveQL
Date Tue, 10 Jul 2012 07:24:16 GMT
You could use TRANSFORM with a simple awk script:

TRANSFORM(a, b, c, d)
  USING "/usr/bin/awk '
  {if($1!=c){c=$1; a=0}; if(a<20){print $0; a++}}'"


This will create a top 20 for each group.  

--Wouter de Bie
Team Lead Analytics Infrastructure, Spotify
wouter@spotify.com (mailto:wouter@spotify.com)
+46 72 018 0777


On Tuesday, July 10, 2012 at 9:15 AM, Raihan Jamal wrote:

> I am trying that solution. Currently I am running my query to see what result I am getting
back with UDF.
>  
>  
>  
>  
> Raihan Jamal  
>  
>  
> On Tue, Jul 10, 2012 at 12:13 AM, Nitin Pawar <nitinpawar432@gmail.com (mailto:nitinpawar432@gmail.com)>
wrote:
> > i thought you managed to solve this with rank??
> >  
> >  
> > On Tue, Jul 10, 2012 at 12:38 PM, Raihan Jamal <jamalraihan@gmail.com (mailto:jamalraihan@gmail.com)>
wrote:
> > > Problem with that approach is, with LIMIT 10, If I am putting after desc, then
it will get only 10 rows irrespective of BUYER_ID. But I need specifically for each BUYER_ID
10 latest rows.
> > >  
> > >  
> > >  
> > >  
> > > Raihan Jamal
> > >  
> > >  
> > >  
> > > On Tue, Jul 10, 2012 at 12:03 AM, Abhishek Tiwari <abhishektiwari.btech@gmail.com
(mailto:abhishektiwari.btech@gmail.com)> wrote:
> > > > Raihan,
> > > >  
> > > > Andes suggests you use 'limit' after 'desc' ie Hive with sort your query
results in descending order and then chop off any more than first 10 records.  
> > > > However, Hive will still run the full scan on data since it has to find
the sorted list of records for you.   
> > > >  
> > > >  
> > > > On Tue, Jul 10, 2012 at 8:36 AM, Raihan Jamal <jamalraihan@gmail.com
(mailto:jamalraihan@gmail.com)> wrote:
> > > > > This is my below requirement. I need-  Find `TOP 10` data for each
`BUYER_ID, So I cannot use LIMIT 10 here in this case.
> > > > >  
> > > > > This is the below table
> > > > >  
> > > > >     CREATE TABLE IF NOT EXISTS TestingTable1  
> > > > >     (  
> > > > >     BUYER_ID BIGINT,
> > > > >     ITEM_ID BIGINT,  
> > > > >     CREATED_TIME STRING
> > > > >     )
> > > > >  
> > > > > And this is the below data in the above table-  
> > > > >  
> > > > >     BUYER_ID    |    ITEM_ID   | CREATED_TIME
> > > > >     ------------+------------------+-----------------------
> > > > >     1015826235      220003038067        2012-07-09 19:40:21,
> > > > >     1015826235      300003861266        2012-07-09 18:19:59,
> > > > >     1015826235      140002997245        2012-07-09 09:23:17,
> > > > >     1015826235      210002448035        2012-07-09 22:21:11,
> > > > >     1015826235      260003553381        2012-07-09 07:09:56,
> > > > >     1015826235      260003553382        2012-07-09 19:40:39,
> > > > >     1015826235      260003553383        2012-07-09 06:58:47,
> > > > >     1015826235      260003553384        2012-07-09 07:28:47,
> > > > >     1015826235      260003553385        2012-07-09 08:48:47,
> > > > >     1015826235      260003553386        2012-07-09 06:38:47,
> > > > >     1015826235      260003553387        2012-07-09 05:38:47,
> > > > >     1015826235      260003553388        2012-07-09 04:55:47,
> > > > >     1015826235      260003553389        2012-07-09 06:54:37,
> > > > >     34512201        597245693           2012-07-09 16:20:21,
> > > > >     34512201        8071787728          2012-07-09 15:19:59,
> > > > >     34512201        5868222883 (tel:5868222883)          2012-07-09
08:23:17,
> > > > >     34512201        2412180494          2012-07-09 22:21:11,
> > > > >     34512201        2422054205          2012-07-09 06:09:56,
> > > > >     34512201        1875744030          2012-07-09 19:40:39,
> > > > >     34512201        5639158173 (tel:5639158173)          2012-07-09
06:58:47,
> > > > >     34512201        5656232360          2012-07-09 07:28:47,  
> > > > >     34512201        959188449 2012-07-09 08:48:47,
> > > > >     34512201        4645350592          2012-07-09 06:38:47,
> > > > >     34512201        5657320532 2012-07-09 05:38:47,
> > > > >     34512201        290419656539 2012-07-09 04:55:47,
> > > > >  
> > > > > If you see the above data in the table, there are only two UNIQUE
`BUYER_ID` and corresponding to those I have `ITEM_ID` AND `CREATED_TIME`. I need only 10
latest record basis on the time for that given day for each `BUYER_ID`.  
> > > > >  
> > > > > So for this `BUYER_ID` - `34512201` I need 10 latest record basis
on `CREATED_TIME` for that given day only, it means for today's date I need 10 latest record
for each `BUYER_ID`.
> > > > >  
> > > > > And each `BUYER_ID` can have any day's data. But I am specifically
interested for day before  today's data(means yesterday's date always) by checking at the
`CREATED_TIME`
> > > > >  
> > > > > **Find `TOP 10` data for each `BUYER_ID`. Below is the sample output.**
 
> > > > >  
> > > > > Sample Output.
> > > > >  
> > > > >     BUYER_ID    |    ITEM_ID   | CREATED_TIME  
> > > > >     ------------+------------------+-----------------------
> > > > >     34512201        2412180494          2012-07-09 22:21:11
> > > > >     34512201        1875744030          2012-07-09 19:40:39
> > > > >     34512201        597245693           2012-07-09 16:20:21
> > > > >     34512201        8071787728          2012-07-09 15:19:59
> > > > >     34512201        959188449 2012-07-09 08:48:47
> > > > >     34512201        5868222883 (tel:5868222883)          2012-07-09
08:23:17
> > > > >     34512201        5656232360          2012-07-09 07:28:47  
> > > > >     34512201        5639158173 (tel:5639158173)          2012-07-09
06:58:47
> > > > >     34512201        4645350592          2012-07-09 06:38:47
> > > > >     34512201        2422054205          2012-07-09 06:09:56
> > > > >     1015826235      210002448035        2012-07-09 22:21:11
> > > > >     1015826235      260003553382        2012-07-09 19:40:39
> > > > >     1015826235      220003038067        2012-07-09 19:40:21
> > > > >     1015826235      300003861266        2012-07-09 18:19:59
> > > > >     1015826235      140002997245        2012-07-09 09:23:17
> > > > >     1015826235      260003553385        2012-07-09 08:48:47
> > > > >     1015826235      260003553384        2012-07-09 07:28:47
> > > > >     1015826235      260003553381        2012-07-09 07:09:56
> > > > >     1015826235      260003553383        2012-07-09 06:58:47
> > > > >     1015826235      260003553389        2012-07-09 06:54:37
> > > > >  
> > > > >  
> > > > > Raihan Jamal
> > > > >  
> > > > >  
> > > > >  
> > > > > On Mon, Jul 9, 2012 at 7:56 PM, Andes <ylyy-1985@163.com (mailto:ylyy-1985@163.com)>
wrote:
> > > > > > hello, you can use "desc" and "limit 10" to filter the top 10.
 
> > > > > >   
> > > > > > 2012-07-10
> > > > > >  
> > > > > > Best Regards
> > > > > > Andes
> > > > > >   
> > > > > >  
> > > > > >  
> > > > > > 发件人:Raihan Jamal
> > > > > > 发送时间:2012-07-10 10 (tel:2012-07-10%C2%A010):31
> > > > > > 主题:Find TOP 10 using HiveQL
> > > > > > 收件人:"user"<user@hive.apache.org (mailto:user@hive.apache.org)>
> > > > > > 抄送:
> > > > > >   
> > > > > > When I run this query,
> > > > > >  
> > > > > > SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time
DESC;
> > > > > >  
> > > > > >  
> > > > > > I always get error as-  
> > > > > >  
> > > > > > FAILED: Parse Error: line 1:7 cannot recognize input 'TOP' in
select expression  
> > > > > >  
> > > > > > Is there any way around to use TOP 10 or something similar that
will work in HiveQL?
> > > > > >  
> > > > > >  
> > > > > > Raihan Jamal
> > > > >  
> > > >  
> > > >  
> > > >  
> > > >  
> > > >  
> > >  
> >  
> >  
> >  
> > --  
> > Nitin Pawar
> >  
>  


Mime
View raw message