hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <jonathan.hw...@accenture.com>
Subject Deduplication Effort in Hadoop
Date Thu, 14 Jul 2011 16:00:51 GMT
Hi All,
In databases you can be able to define primary keys to ensure no duplicate data get loaded
into the system.   Let say I have a lot of 1 billion records flowing into my system everyday
and some of these are repeated data (Same records).   I can use 2-3 columns in the record
to match and look for duplicates.   What is the best strategy of de-duplication?  The duplicated
records should only appear within the last 2 weeks.    I want a fast way to get the data into
the system without much delay.  Anyway HBase or Hive can help?


This message is for the designated recipient only and may contain privileged, proprietary,
or otherwise private information. If you have received it in error, please notify the sender
immediately and delete the original. Any other use of the email by you is prohibited.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message