hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Gause <bob.ga...@zyquest.com>
Subject Re: Identifying and Marking records as duplicates
Date Fri, 17 Aug 2012 15:46:58 GMT
We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in some of our queries.
You could do that, and then do another select where the row number/rank is 1 to get all the
"unique" rows.

There are probably a bunch of other ways to do this, but this is the one that first came to
mind for me….

Enjoy!
Bob

Robert Gause
Senior Systems Engineer
ZyQuest, Inc.
bob.gause@zyquest.com

On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote:

> Hi,
> 
> We have a huge table which may have duplicate records.A record is considered duplicate
based on 4 fields ( fld1 thru fld4) . We need to identify the duplicate records and possibly
mark the duplicates(except the first record based on created time for a record).
> 
> Is this something that could be done by hive or we need to write custom M/R for this.Could
a inner join or a select with group by be used to find the duplicates ? How do I mark the
duplicate records as there is no update.
> 
> Whats the best way to do this using Hive ? Looking forward to hear the suggestions.
> 
> Thanks


Mime
View raw message