hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Gause <>
Subject Re: Identifying and Marking records as duplicates
Date Fri, 17 Aug 2012 15:46:58 GMT
We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in some of our queries.
You could do that, and then do another select where the row number/rank is 1 to get all the
"unique" rows.

There are probably a bunch of other ways to do this, but this is the one that first came to
mind for me….


Robert Gause
Senior Systems Engineer
ZyQuest, Inc.

On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote:

> Hi,
> We have a huge table which may have duplicate records.A record is considered duplicate
based on 4 fields ( fld1 thru fld4) . We need to identify the duplicate records and possibly
mark the duplicates(except the first record based on created time for a record).
> Is this something that could be done by hive or we need to write custom M/R for this.Could
a inner join or a select with group by be used to find the duplicates ? How do I mark the
duplicate records as there is no update.
> Whats the best way to do this using Hive ? Looking forward to hear the suggestions.
> Thanks

View raw message