Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2E63E11FB4 for ; Wed, 10 Sep 2014 19:38:26 +0000 (UTC) Received: (qmail 47504 invoked by uid 500); 10 Sep 2014 19:38:24 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 47442 invoked by uid 500); 10 Sep 2014 19:38:24 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 47432 invoked by uid 99); 10 Sep 2014 19:38:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Sep 2014 19:38:24 +0000 X-ASF-Spam-Status: No, hits=2.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nishant.k02@gmail.com designates 209.85.220.180 as permitted sender) Received: from [209.85.220.180] (HELO mail-vc0-f180.google.com) (209.85.220.180) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Sep 2014 19:38:18 +0000 Received: by mail-vc0-f180.google.com with SMTP id hq11so957975vcb.25 for ; Wed, 10 Sep 2014 12:37:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=cym2tsXgVAJ/PLxgqQ5Gea3GS6dJPvTBLo1hD2X6SGo=; b=GWzSlI87cZjjkzzGN/Y4DVTNt7d2tZQPwKb5k57mRmgPkTANu1Slzmzz3K9sC0APzp WUqWOtQRJA1mRkll057v+bjyWwAN3byGrtBDuMyht2WsuIXSbJwfFsw73+LWwxp+hIoV iL4Z1VKG9z8TRoAEBxF+8PBDBIvc1xFTVJd83DW2Vo83UQ9kkYNbUpvfa/qM8gYYLanH /iUk9qV3eJv4lNwa71UXuBX8HkmLoIMiF9jdgPR8udIA37phrLGffjgE+Qymth4BUsdX C2LUxjBfdXQOERjnIVqtlTck0Xz5H+ZawHBoQkAMiw86GT3HqoNKg7G5FLkQYg3YcdGa TaZQ== MIME-Version: 1.0 X-Received: by 10.220.172.8 with SMTP id j8mr22247500vcz.32.1410377877688; Wed, 10 Sep 2014 12:37:57 -0700 (PDT) Received: by 10.52.233.39 with HTTP; Wed, 10 Sep 2014 12:37:57 -0700 (PDT) In-Reply-To: <1410374935.9863.YahooMailBasic@web120703.mail.ne1.yahoo.com> References: <1410374500.34068.YahooMailBasic@web120704.mail.ne1.yahoo.com> <1410374935.9863.YahooMailBasic@web120703.mail.ne1.yahoo.com> Date: Wed, 10 Sep 2014 12:37:57 -0700 Message-ID: Subject: Re: Remove duplicate records in Hive From: Nishant Kelkar To: user@hive.apache.org, hadoopraj@yahoo.com Content-Type: multipart/alternative; boundary=001a11c3678c6616560502bb30bb X-Virus-Checked: Checked by ClamAV on apache.org --001a11c3678c6616560502bb30bb Content-Type: text/plain; charset=UTF-8 Hi Raj, You'll have to change the format of your date to something like YYYY-MM-DD. For example, for "2-oct-2013" it will be 2013-10-02. Best Regards, Nishant Kelkar On Wed, Sep 10, 2014 at 11:48 AM, Raj Hadoop wrote: > The > > SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date > > is returning the lowest date. I need the largest date. > > > > -------------------------------------------- > On Wed, 9/10/14, Raj Hadoop wrote: > > Subject: Re: Remove duplicate records in Hive > To: user@hive.apache.org > Date: Wednesday, September 10, 2014, 2:41 PM > > Thanks. I will try it. > -------------------------------------------- > On Wed, 9/10/14, Nishant Kelkar > wrote: > > Subject: Re: Remove > duplicate records in Hive > To: user@hive.apache.org, > hadoopraj@yahoo.com > Date: Wednesday, September 10, 2014, 1:59 > PM > > Hi > > Raj, > You can do something > along these lines: > > SELECT > cno, sqno, > SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date > FROM table GROUP BY cno, sqno; > However, you have to make sure your > date format is such that sorting it gives you > the most > recent date. The best way to do > that is to have it in > format: > YYYY-MM-DD. > Hope this helps. > Best Regards,Nishant > > Kelkar > On Wed, Sep 10, 2014 at > 10:04 AM, Raj Hadoop > wrote: > > > Hi, > > > > I have a requirement in Hive > to remove duplicate records ( > they differ > only by one column i.e a date column) and keep > the latest date record. > > > > Sample > : > > Hive Table : > > d2 is a higher > > cno,sqno,date > > > > 100 1 1-oct-2013 > > 101 2 1-oct-2013 > > 100 1 2-oct-2013 > > 102 2 2-oct-2013 > > > > > > Output needed: > > > > 100 1 2-oct-2013 > > 101 2 1-oct-2013 > > 102 2 2-oct-2013 > > > > I am using > Hive 0.11 > > > > Any suggestions please ? > > > > Regards, > > > Raj > > > > --001a11c3678c6616560502bb30bb Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Raj,

You'll have to change the f= ormat of your date to something like YYYY-MM-DD. For example, for "2-o= ct-2013" it will be 2013-10-02.

Best Regards,=
Nishant Kelkar



On Wed, Sep 10, 2014 at 11:48 AM, Raj Hadoo= p <hadoopraj@yahoo.com> wrote:
The

SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date

is returning the lowest date. I need the largest date.



--------------------------------------------
On Wed, 9/10/14, Raj Hadoop <hadoopraj@yahoo.com> wrote:

=C2=A0Subject: Re: Remove duplicate records in Hive
=C2=A0To: user@hive.apache.org<= br>
=C2=A0Date: Wednesday, September 10, 2014, 2:41 PM

=C2=A0Thanks. I will try it.
=C2=A0--------------------------------------------
=C2=A0On Wed, 9/10/14, Nishant Kelkar <nishant.k02@gmail.com>
=C2=A0wrote:

=C2=A0 Subject: Re: Remove
=C2=A0duplicate records in Hive
=C2=A0 To: user@hive.apache.org= ,
=C2=A0hadoopraj@yahoo.com
=C2=A0 Date: Wednesday, September 10, 2014, 1:59
=C2=A0PM

=C2=A0 Hi

=C2=A0Raj,=C2=A0
=C2=A0 You can do something
=C2=A0 along these lines:=C2=A0

=C2=A0 SELECT
=C2=A0 cno, sqno,
=C2=A0SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
=C2=A0 FROM table GROUP BY cno, sqno;
=C2=A0 However, you have to make sure your
=C2=A0 date format is such that sorting it gives you
=C2=A0the most
=C2=A0 recent date. The best way to do
=C2=A0that is to have it in
=C2=A0 format:
=C2=A0YYYY-MM-DD.
=C2=A0 Hope this helps.
=C2=A0 Best Regards,Nishant

=C2=A0Kelkar
=C2=A0 On Wed, Sep 10, 2014 at
=C2=A0 10:04 AM, Raj Hadoop <hado= opraj@yahoo.com>
=C2=A0 wrote:


=C2=A0 Hi,



=C2=A0 I have a requirement in Hive
=C2=A0to remove duplicate records (
=C2=A0 they differ
=C2=A0only by one column i.e a date column) and keep
=C2=A0 the latest date record.



=C2=A0 Sample
=C2=A0:

=C2=A0 Hive Table :

=C2=A0 =C2=A0d2 is a higher

=C2=A0 cno,sqno,date



=C2=A0 100 1 1-oct-2013

=C2=A0 101 2 1-oct-2013

=C2=A0 100 1 2-oct-2013

=C2=A0 102 2 2-oct-2013





=C2=A0 Output needed:



=C2=A0 100 1 2-oct-2013

=C2=A0 101 2 1-oct-2013

=C2=A0 102 2 2-oct-2013



=C2=A0 I am using
=C2=A0Hive 0.11



=C2=A0 Any suggestions please ?



=C2=A0 Regards,


=C2=A0Raj




--001a11c3678c6616560502bb30bb--