hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "C.V.Krishnakumar Iyer" <f2004...@gmail.com>
Subject Re: Deduplication Effort in Hadoop
Date Thu, 14 Jul 2011 19:11:19 GMT
Hi,

I guess by "system" you meant HDFS.

In that case HBase might help. HBase needs to have unique keys. They are just bytes, so I
guess you can just concatenate multiple columns in your primary key ( if you have a primary
key spanning >1 column)  to have a key for HBase, so that duplicates dont exist. 

So, data can be stored in HBase rather than in files and everything else is still the same.

I dont know about Hive though.

Thanks,
Krishnakumar.



On Jul 14, 2011, at 9:18 AM, Michael Segel wrote:

> You don't have dupes because the key has to be unique.&nbsp;
> 
> 
> 
> Sent from my Palm Pre on AT&amp;T
> On Jul 14, 2011 11:00 AM, jonathan.hwang@accenture.com &lt;jonathan.hwang@accenture.com&gt;
wrote: 
> 
> Hi All,
> 
> In databases you can be able to define primary keys to ensure no duplicate data get loaded
into the system.   Let say I have a lot of 1 billion records flowing into my system everyday
and some of these are repeated data (Same records).   I can use 2-3 columns in the record
to match and look for duplicates.   What is the best strategy of de-duplication?  The duplicated
records should only appear within the last 2 weeks.    I want a fast way to get the data into
the system without much delay.  Anyway HBase or Hive can help?
> 
> 
> 
> Thanks!
> 
> Jonathan
> 
> 
> 
> ________________________________
> 
> This message is for the designated recipient only and may contain privileged, proprietary,
or otherwise private information. If you have received it in error, please notify the sender
immediately and delete the original. Any other use of the email by you is prohibited.
> 
> 
> 
> 


Mime
View raw message