Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B5A809AC9 for ; Tue, 10 Jul 2012 07:24:54 +0000 (UTC) Received: (qmail 18221 invoked by uid 500); 10 Jul 2012 07:24:50 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 18048 invoked by uid 500); 10 Jul 2012 07:24:50 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 18034 invoked by uid 99); 10 Jul 2012 07:24:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Jul 2012 07:24:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of wouter@spotify.com designates 209.85.217.176 as permitted sender) Received: from [209.85.217.176] (HELO mail-lb0-f176.google.com) (209.85.217.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Jul 2012 07:24:43 +0000 Received: by lboj14 with SMTP id j14so19912981lbo.35 for ; Tue, 10 Jul 2012 00:24:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=date:from:to:message-id:in-reply-to:references:subject:x-mailer :mime-version:content-type:x-gm-message-state; bh=Ni+CmYBE1zleAXjg2HJjIBYxQE7hkFqjypgYZ0nf6e8=; b=nu3+84XmThpNXb98B/libfHOcjULfmY6nyShTvLApRmgcyaXn7V2PfuC1BWLeR8apy QoMKway9c7Ybo3Em9KbxkpRqpguGFCA0s/T4ZaSxiLID6b7x7dDEQLTHl0z+/yceFDjH eoxoAzdfGNvF6PLCq6qde6V+MQWE9njQbqbDIqPrFk61vVkQfmrHb6NdDlWcx+hCNCI4 9wZ2RTzTd5e6ue8pcjcctfQhk2lfEUPox71q0gWFJT2MPj8TXdnF703MemvFG0UiAPD8 h0Zq/Pw+qRqBUWQwc6oC1hM15l+pxELm5D9mJ/o2QrA9jyZzMI/wfItKCcEdaXZYz4Vc Wmmg== Received: by 10.152.127.98 with SMTP id nf2mr7923224lab.4.1341905060547; Tue, 10 Jul 2012 00:24:20 -0700 (PDT) Received: from [194.236.60.42] (42.60.236.194.office.spotify.net. [194.236.60.42]) by mx.google.com with ESMTPS id fy10sm49543518lab.0.2012.07.10.00.24.18 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 10 Jul 2012 00:24:19 -0700 (PDT) Date: Tue, 10 Jul 2012 09:24:16 +0200 From: Wouter de Bie To: user@hive.apache.org Message-ID: <8466DAB3415C4BADBD934FCAD0CA4C49@spotify.com> In-Reply-To: References: <575040e1.34e0.1386ed10f39.Coremail.ylyy-1985@163.com> Subject: Re: Find TOP 10 using HiveQL X-Mailer: sparrow 1.6.1 (build 1081.52) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="4ffbd8a0_257130a3_5cdd" X-Gm-Message-State: ALoCoQnRDAnUR4vhvzZ77nnIpCgr1xRzcfu0F7G62qnACup0M4xoj/eBcarYZMO0BCvVpcqyDBtS X-Virus-Checked: Checked by ClamAV on apache.org --4ffbd8a0_257130a3_5cdd Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline You could use TRANS=46ORM with a simple awk script: TRANS=46ORM(a, b, c, d) USING =22/usr/bin/awk ' =7Bif(=241=21=3Dc)=7Bc=3D=241; a=3D0=7D; if(a<20)=7Bprint =240; a++=7D=7D= '=22 This will create a top 20 for each group. =20 --Wouter de Bie Team Lead Analytics Infrastructure, Spotify wouter=40spotify.com (mailto:wouter=40spotify.com) +46 72 018 0777 On Tuesday, July 10, 2012 at 9:15 AM, Raihan Jamal wrote: > I am trying that solution. Currently I am running my query to see what = result I am getting back with UD=46. > =20 > =20 > =20 > =20 > Raihan Jamal =20 > =20 > =20 > On Tue, Jul 10, 2012 at 12:13 AM, Nitin Pawar wrote: > > i thought you managed to solve this with rank=3F=3F > > =20 > > =20 > > On Tue, Jul 10, 2012 at 12:38 PM, Raihan Jamal wrote: > > > Problem with that approach is, with LIMIT 10, If I am putting after= desc, then it will get only 10 rows irrespective of BUYER=5FID. But I ne= ed specifically for each BUYER=5FID 10 latest rows. > > > =20 > > > =20 > > > =20 > > > =20 > > > Raihan Jamal > > > =20 > > > =20 > > > =20 > > > On Tue, Jul 10, 2012 at 12:03 AM, Abhishek Tiwari wrote: > > > > Raihan, > > > > =20 > > > > Andes suggests you use 'limit' after 'desc' ie Hive with sort you= r query results in descending order and then chop off any more than first= 10 records. =20 > > > > However, Hive will still run the full scan on data since it has t= o find the sorted list of records for you. =20 > > > > =20 > > > > =20 > > > > On Tue, Jul 10, 2012 at 8:36 AM, Raihan Jamal wrote: > > > > > This is my below requirement. I need- =46ind =60TOP 10=60 data= for each =60BUYER=5FID, So I cannot use LIMIT 10 here in this case. > > > > > =20 > > > > > This is the below table > > > > > =20 > > > > > CREATE TABLE I=46 NOT EXISTS TestingTable1 =20 > > > > > ( =20 > > > > > BUYER=5FID BIGINT, > > > > > ITEM=5FID BIGINT, =20 > > > > > CREATED=5FTIME STRING > > > > > ) > > > > > =20 > > > > > And this is the below data in the above table- =20 > > > > > =20 > > > > > BUYER=5FID =7C ITEM=5FID =7C CREATED=5FTIME > > > > > ------------+------------------+----------------------- > > > > > 1015826235 220003038067 2012-07-09 19:40:21, > > > > > 1015826235 300003861266 2012-07-09 18:19:59, > > > > > 1015826235 140002997245 2012-07-09 09:23:17, > > > > > 1015826235 210002448035 2012-07-09 22:21:11, > > > > > 1015826235 260003553381 2012-07-09 07:09:56, > > > > > 1015826235 260003553382 2012-07-09 19:40:39, > > > > > 1015826235 260003553383 2012-07-09 06:58:47, > > > > > 1015826235 260003553384 2012-07-09 07:28:47, > > > > > 1015826235 260003553385 2012-07-09 08:48:47, > > > > > 1015826235 260003553386 2012-07-09 06:38:47, > > > > > 1015826235 260003553387 2012-07-09 05:38:47, > > > > > 1015826235 260003553388 2012-07-09 04:55:47, > > > > > 1015826235 260003553389 2012-07-09 06:54:37, > > > > > 34512201 597245693 2012-07-09 16:20:21, > > > > > 34512201 8071787728 2012-07-09 15:19:59, > > > > > 34512201 5868222883 (tel:5868222883) 2012-0= 7-09 08:23:17, > > > > > 34512201 2412180494 2012-07-09 22:21:11, > > > > > 34512201 2422054205 2012-07-09 06:09:56, > > > > > 34512201 1875744030 2012-07-09 19:40:39, > > > > > 34512201 5639158173 (tel:5639158173) 2012-0= 7-09 06:58:47, > > > > > 34512201 5656232360 2012-07-09 07:28:47, =20 > > > > > 34512201 959188449 2012-07-09 08:48:47, > > > > > 34512201 4645350592 2012-07-09 06:38:47, > > > > > 34512201 5657320532 2012-07-09 05:38:47, > > > > > 34512201 290419656539 2012-07-09 04:55:47, > > > > > =20 > > > > > If you see the above data in the table, there are only two UNIQ= UE =60BUYER=5FID=60 and corresponding to those I have =60ITEM=5FID=60 AND= =60CREATED=5FTIME=60. I need only 10 latest record basis on the time for= that given day for each =60BUYER=5FID=60. =20 > > > > > =20 > > > > > So for this =60BUYER=5FID=60 - =6034512201=60 I need 10 latest = record basis on =60CREATED=5FTIME=60 for that given day only, it means fo= r today's date I need 10 latest record for each =60BUYER=5FID=60. > > > > > =20 > > > > > And each =60BUYER=5FID=60 can have any day's data. But I am spe= cifically interested for day before today's data(means yesterday's date = always) by checking at the =60CREATED=5FTIME=60 > > > > > =20 > > > > > **=46ind =60TOP 10=60 data for each =60BUYER=5FID=60. Below is = the sample output.** =20 > > > > > =20 > > > > > Sample Output. > > > > > =20 > > > > > BUYER=5FID =7C ITEM=5FID =7C CREATED=5FTIME =20 > > > > > ------------+------------------+----------------------- > > > > > 34512201 2412180494 2012-07-09 22:21:11 > > > > > 34512201 1875744030 2012-07-09 19:40:39 > > > > > 34512201 597245693 2012-07-09 16:20:21 > > > > > 34512201 8071787728 2012-07-09 15:19:59 > > > > > 34512201 959188449 2012-07-09 08:48:47 > > > > > 34512201 5868222883 (tel:5868222883) 2012-0= 7-09 08:23:17 > > > > > 34512201 5656232360 2012-07-09 07:28:47 =20 > > > > > 34512201 5639158173 (tel:5639158173) 2012-0= 7-09 06:58:47 > > > > > 34512201 4645350592 2012-07-09 06:38:47 > > > > > 34512201 2422054205 2012-07-09 06:09:56 > > > > > 1015826235 210002448035 2012-07-09 22:21:11 > > > > > 1015826235 260003553382 2012-07-09 19:40:39 > > > > > 1015826235 220003038067 2012-07-09 19:40:21 > > > > > 1015826235 300003861266 2012-07-09 18:19:59 > > > > > 1015826235 140002997245 2012-07-09 09:23:17 > > > > > 1015826235 260003553385 2012-07-09 08:48:47 > > > > > 1015826235 260003553384 2012-07-09 07:28:47 > > > > > 1015826235 260003553381 2012-07-09 07:09:56 > > > > > 1015826235 260003553383 2012-07-09 06:58:47 > > > > > 1015826235 260003553389 2012-07-09 06:54:37 > > > > > =20 > > > > > =20 > > > > > Raihan Jamal > > > > > =20 > > > > > =20 > > > > > =20 > > > > > On Mon, Jul 9, 2012 at 7:56 PM, Andes wrote: > > > > > > hello, you can use =22desc=22 and =22limit 10=22 to filter th= e top 10. =20 > > > > > > =20 > > > > > > 2012-07-10 > > > > > > =20 > > > > > > Best Regards > > > > > > Andes > > > > > > =20 > > > > > > =20 > > > > > > =20 > > > > > > =E5=8F=91=E4=BB=B6=E4=BA=BA=EF=BC=9ARaihan Jamal > > > > > > =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4=EF=BC=9A2012-07-10 10 (t= el:2012-07-10%C2%A010):31 > > > > > > =E4=B8=BB=E9=A2=98=EF=BC=9A=46ind TOP 10 using HiveQL > > > > > > =E6=94=B6=E4=BB=B6=E4=BA=BA=EF=BC=9A=22user=22 > > > > > > =E6=8A=84=E9=80=81=EF=BC=9A > > > > > > =20 > > > > > > When I run this query, > > > > > > =20 > > > > > > SELECT TOP 10 =46ROM TestingTable1 WHERE ORDER BY buyer=5Fid,= created=5Ftime DESC; > > > > > > =20 > > > > > > =20 > > > > > > I always get error as- =20 > > > > > > =20 > > > > > > =46AILED: Parse Error: line 1:7 cannot recognize input 'TOP' = in select expression =20 > > > > > > =20 > > > > > > Is there any way around to use TOP 10 or something similar th= at will work in HiveQL=3F > > > > > > =20 > > > > > > =20 > > > > > > Raihan Jamal > > > > > =20 > > > > =20 > > > > =20 > > > > =20 > > > > =20 > > > > =20 > > > =20 > > =20 > > =20 > > =20 > > -- =20 > > Nitin Pawar > > =20 > =20 --4ffbd8a0_257130a3_5cdd Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline
You could use TRANS=46ORM with a simple awk script:

TRANS=46ORM(a, b, c, d)
  USING =22/usr/bin/awk '
  =7Bif(=241=21=3Dc)=7Bc=3D=241; a=3D0=7D; if(a<20)=7Bpr= int =240; a++=7D=7D'=22

This will create a= top 20 for each group.

--
Wouter de Bie
Team Lead Analytics Infrastructure, Spotify
wouter=40spotify.com
+46 72 018 0777
=

=20

On Tuesday, July 10, 2= 012 at 9:15 AM, Raihan Jamal wrote:

I am trying that= solution. Currently I am running my query to see what result I am gettin= g back with UD=46.



=
Raihan Jamal



On Tue, Jul 10, 2012 at 12:13 AM, Nitin Pawar <nitinpawar432=40gmail.com> wrote:
i thought you managed to solve this with rank=3F=3F


On Tue, Jul 10, 2012 at 12:38 PM, Raihan Jamal <= ;j= amalraihan=40gmail.com> wrote:
Problem with that approach is, with LIMIT = 10, If I am putting after desc, then it will get only 10 rows irrespectiv= e of BUYER=5FID. But I need specifically for each BUYER=5FID 10 latest ro= ws.




Raihan Jamal



On Tue, Jul 10, 2012 at 12:03 AM, Abhishek Tiwari <abhishektiwari.btech=40gmail.com> wrot= e:
Raihan,

Andes= suggests you use 'limit' after 'desc' ie Hive with sort your query resul= ts in descending order and then chop off any more than first 10 records.<= /div>
However, Hive will still run the full scan on data since it has to f= ind the sorted list of records for you.  


On Tue, Jul 10, 2012 at 8:36 AM, Raihan Jamal <jamalraihan=40gmail.com> wrote:
This is my below requirement. I need-  = =46ind =60TOP 10=60 data for each =60BUYER=5FID, So I cannot use LIMI= T 10 here in this case.

This is the below table<= /div>

    CREATE TABLE I=46 NOT EXISTS TestingTable1 
=     ( 
    BUYER=5FID BIGINT,
    ITEM=5FID BIGINT, 
    CREATED=5F= TIME STRING
    )

And this i= s the below data in the above table-

    BUYER=5FID    =7C   &nbs= p;ITEM=5FID   =7C CREATED=5FTIME
&= nbsp;   ------------+------------------+-----------------------
    1015826235      220003038067   &nb= sp;    2012-07-09 19:40:21,
    1015826235 =      300003861266        2012-07-09 18= :19:59,
    1015826235      1400029972= 45        2012-07-09 09:23:17,
   = ; 1015826235      210002448035        = 2012-07-09 22:21:11,
    1015826235      260003553381   &nb= sp;    2012-07-09 07:09:56,
    1015826235 =      260003553382        2012-07-09 19= :40:39,
    1015826235      2600035533= 83        2012-07-09 06:58:47,
   = ; 1015826235      260003553384        = 2012-07-09 07:28:47,
    1015826235      260003553385   &nb= sp;    2012-07-09 08:48:47,
    1015826235 =      260003553386        2012-07-09 06= :38:47,
    1015826235      2600035533= 87        2012-07-09 05:38:47,
   = ; 1015826235      260003553388        = 2012-07-09 04:55:47,
    1015826235      260003553389   &nb= sp;    2012-07-09 06:54:37,
    34512201 &n= bsp;      597245693           201= 2-07-09 16:20:21,
    34512201       &= nbsp;8071787728          2012-07-09 15:19:59,
    34512201        586822= 2883          2012-07-09 08:23:17,
    34512201        2412180494   =        2012-07-09 22:21:11,
    3= 4512201        2422054205        =  2012-07-09 06:09:56,
    34512201    = ;    1875744030          2012-07-09 19= :40:39,
    34512201        5639158173          2012-07-09 06:58:47,=
    34512201        5656232360   =        2012-07-09 07:28:47, 
  &n= bsp; 34512201        959188449 2012-07-09 08:48:47,
    = 34512201        4645350592        = ;  2012-07-09 06:38:47,
    34512201        5657320532 2012-07-09 05:38:47,
    34512201        290419656539 2012-07-09 04:55:47,

If you see the above data in the table, there are onl= y two UNIQUE =60BUYER=5FID=60 and corresponding to those I have =60ITEM=5F= ID=60 AND =60CREATED=5FTIME=60. I need only 10 latest record basis on the= time for that given day for each =60BUYER=5FID=60.

So for this =60BUYER=5FID=60 - =6034512201=60 I need = 10 latest record basis on =60CREATED=5FTIME=60 for that given day only, i= t means for today's date I need 10 latest record for each =60BUYER=5FID=60= .

And each =60BUYER=5FID=60 can have any day's data. But I am specific= ally interested for day before  today's data(means yesterday's date = always) by checking at the =60CREATED=5FTIME=60

= **=46ind =60TOP 10=60 data for each =60BUYER=5FID=60. Below is the sample= output.**

Sample Output.

   = BUYER=5FID    =7C    ITEM=5FID   =7C CREATED=5FTIME
    ------------+------------------+----------------------= -
    34512201        2412180494 =          2012-07-09 22:21:11
  &n= bsp; 34512201        1875744030       =    2012-07-09 19:40:39
    34512201   =      597245693           2012-07-= 09 16:20:21
    34512201        8071787728   =        2012-07-09 15:19:59
    34= 512201        959188449 2012-07-09 08:48:47
    34512201=        5868222883       =    2012-07-09 08:23:17
    34512201        5656232360   =        2012-07-09 07:28:47 
  &nb= sp; 34512201        5639158173   &n= bsp;      2012-07-09 06:58:47
    3451= 2201        4645350592         &n= bsp;2012-07-09 06:38:47
    34512201        2422054205   =        2012-07-09 06:09:56
    1015826235      210002448035   &nb= sp;    2012-07-09 22:21:11
    1015826235 &= nbsp;    260003553382        2012-07-09 19:= 40:39
    1015826235      220003038067=        2012-07-09 19:40:21
    1= 015826235      300003861266        201= 2-07-09 18:19:59
    1015826235      140002997245   &nb= sp;    2012-07-09 09:23:17
    1015826235 &= nbsp;    260003553385        2012-07-09 08:= 48:47
    1015826235      260003553384=        2012-07-09 07:28:47
    1= 015826235      260003553381        201= 2-07-09 07:09:56
    1015826235      260003553383   &nb= sp;    2012-07-09 06:58:47
    1015826235 &= nbsp;    260003553389        2012-07-09 06:= 54:37


Raihan Jamal



On Mon, Jul 9, 2012 at 7:56 PM, Andes = <yl= yy-1985=40163.com> wrote:
=
hello, you =20 can use =22desc=22 and =22limit 10=22 to filter the top 10.
 
2012-07-10

Best Regards
Andes
 

=E5=8F=91=E4=BB=B6=E4=BA=BA=EF=BC= =9ARaihan Jamal
=E5=8F=91=E9=80=81=E6=97=B6=E9=97= =B4=EF=BC=9A2012-07-10 10:31
=E4=B8=BB=E9=A2=98=EF=BC=9A=46ind TOP 10 using =20 HiveQL
=E6=94=B6=E4=BB=B6=E4=BA=BA=EF=BC= =9A=22user=22<user=40hive.apache.org>
=E6=8A=84=E9=80=81=EF=BC=9A
 
When I run this query,

SELECT TOP 10 =46ROM TestingTable1 WHERE ORDER BY buyer=5F= id, =20 created=5Ftime DESC;


I always get error as-

=46AILED: Parse Error: line 1:7 cannot recognize = input 'TOP' in =20 select expression

Is there any way around to use TOP 10 or something similar that will= work =20 in HiveQL=3F



Raihan =20 Jamal











<= /div>--
Nitin Pawar


=20 =20 =20 =20 =20

--4ffbd8a0_257130a3_5cdd--