hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raj Hadoop <hadoop...@yahoo.com>
Subject Re: Remove duplicate records in Hive
Date Wed, 10 Sep 2014 18:48:55 GMT
The

SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date

is returning the lowest date. I need the largest date.



--------------------------------------------
On Wed, 9/10/14, Raj Hadoop <hadoopraj@yahoo.com> wrote:

 Subject: Re: Remove duplicate records in Hive
 To: user@hive.apache.org
 Date: Wednesday, September 10, 2014, 2:41 PM
 
 Thanks. I will try it.
 --------------------------------------------
 On Wed, 9/10/14, Nishant Kelkar <nishant.k02@gmail.com>
 wrote:
 
  Subject: Re: Remove
 duplicate records in Hive
  To: user@hive.apache.org,
 hadoopraj@yahoo.com
  Date: Wednesday, September 10, 2014, 1:59
 PM
  
  Hi
 
 Raj, 
  You can do something
  along these lines: 
  
  SELECT
  cno, sqno,
 SORT_ARRAY(COLLECT_SET(date))[0] AS latest_date
  FROM table GROUP BY cno, sqno;
  However, you have to make sure your
  date format is such that sorting it gives you
 the most
  recent date. The best way to do
 that is to have it in
  format:
 YYYY-MM-DD.
  Hope this helps.
  Best Regards,Nishant
 
 Kelkar
  On Wed, Sep 10, 2014 at
  10:04 AM, Raj Hadoop <hadoopraj@yahoo.com>
  wrote:
  
  
  Hi,
  
  
  
  I have a requirement in Hive
 to remove duplicate records (
  they differ
 only by one column i.e a date column) and keep
  the latest date record.
  
  
  
  Sample
 :
  
  Hive Table :
  
   d2 is a higher
  
  cno,sqno,date
  
  
  
  100 1 1-oct-2013
  
  101 2 1-oct-2013
  
  100 1 2-oct-2013
  
  102 2 2-oct-2013
  
  
  
  
  
  Output needed:
  
  
  
  100 1 2-oct-2013
  
  101 2 1-oct-2013
  
  102 2 2-oct-2013
  
  
  
  I am using
 Hive 0.11
  
  
  
  Any suggestions please ?
  
  
  
  Regards,
  
 
 Raj
  
  
 

Mime
View raw message