hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Himanish Kushary <himan...@gmail.com>
Subject Identifying and Marking records as duplicates
Date Fri, 17 Aug 2012 14:49:45 GMT
Hi,

We have a huge table which may have duplicate records.A record is
considered duplicate based on 4 fields ( fld1 thru fld4) . We need to
identify the duplicate records and possibly mark the duplicates(except the
first record based on created time for a record).

Is this something that could be done by hive or we need to write custom M/R
for this.Could a inner join or a select with group by be used to find the
duplicates ? How do I mark the duplicate records as there is no update.

Whats the best way to do this using Hive ? Looking forward to hear the
suggestions.

Thanks

Mime
View raw message