hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Himanish Kushary <himan...@gmail.com>
Subject Re: Identifying and Marking records as duplicates
Date Fri, 17 Aug 2012 16:05:38 GMT
We want to separate them out into different tables for auditing purposes,
so i was thinking of marking them and then
separating them out based on the "duplicate" flag.

Is there a better way of splitting them into 2 tables ? Also how would I
pick up the first from the group and put into a different table
once I have run the cross join on fld1 thru fld4 with order by timestamp?

- Himanish

On Fri, Aug 17, 2012 at 11:54 AM, Bertrand Dechoux <dechouxb@gmail.com>wrote:

> If I understand correctly :
> A cross join on fld1 thru fld4 with order by timestamp and take the first
> of each group.
> would allow you to have all the unique and original copies.
>
> But indeed if you don't have something like HBase, you can not mark them
> really but you can select them and write them elsewhere.
> Why do you specifically want to mark them?
>
> Bertrand
>
>
> On Fri, Aug 17, 2012 at 5:46 PM, Bob Gause <bob.gause@zyquest.com> wrote:
>
>> We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in
>> some of our queries. You could do that, and then do another select where
>> the row number/rank is 1 to get all the "unique" rows.
>>
>> There are probably a bunch of other ways to do this, but this is the one
>> that first came to mind for me….
>>
>> Enjoy!
>> Bob
>>
>> Robert Gause
>> Senior Systems Engineer
>> ZyQuest, Inc.
>> bob.gause@zyquest.com
>>
>> On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote:
>>
>> > Hi,
>> >
>> > We have a huge table which may have duplicate records.A record is
>> considered duplicate based on 4 fields ( fld1 thru fld4) . We need to
>> identify the duplicate records and possibly mark the duplicates(except the
>> first record based on created time for a record).
>> >
>> > Is this something that could be done by hive or we need to write custom
>> M/R for this.Could a inner join or a select with group by be used to find
>> the duplicates ? How do I mark the duplicate records as there is no update.
>> >
>> > Whats the best way to do this using Hive ? Looking forward to hear the
>> suggestions.
>> >
>> > Thanks
>>
>>
>
>
> --
> Bertrand Dechoux
>



-- 
Thanks & Regards
Himanish

Mime
View raw message