hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abinash Karana \(Bizosys\)" <abin...@bizosys.com>
Subject RE: evaluating HBase
Date Mon, 17 Jan 2011 12:42:06 GMT
Hi Eric,
The duplicate record problem is addressed by Nutch by designing a signature.

This signature helps them to find whether the information is duplicated or
not. Your design is also good..

However, there is one possible issue with this.. that is after you read a
key and matching documents, [ a, b, c, d, e ]. For the details you may need
to do a random read.. This will be a slow process.

MySQL (sharded) vs HBase - Please share the findings..

Abinash Karan

-----Original Message-----
From: eric_bdr@yahoo.com [mailto:eric_bdr@yahoo.com] 
Sent: Monday, January 17, 2011 5:52 PM
To: dev@hbase.apache.org
Subject: evaluating HBase


I am currently evaluating HBase for an implementation of an ERP-like cloud 
solution that's supposed to handle 500M lines per year for the biggest
and 10-20m for the smaller tenants.  I am writing a couple prototypes, one
MySQL (sharded) and one with HBase - I will let you know what I find if you
interested.  Anyway, I have 2 questions:

The first one is regarding the following post and I would like to get a 
perspective from the no-sql camp on this one.

The second is regarding how to best implement a 'duplicate check'
 Here is what I have done so far: I have a single entity table and I have 
created an indexed table where the key is the concatenated value of the 4 
attributes of the entity (these 4 attributes are the definition of what 
constitutes a duplicate record while the entity can have around 100-150 
different attributes).  In this indexed table, I have a column in which I
a comma delimited list of all the keys that corresponds to entities that
the same set of 4 attribute values.

For example: (assuming that a dup is defined by entities having values of a
b be the same)

key, a, b, c, d, e
1, 1, 1, 1, 1, 1
2, 1, 1, 2, 2, 2
3, 1, 2, 2, 2, 2
4, 2, 2, 2, 2, 2

key, value
11, [1, 2]
12, [3]
22, [4]

When I scan through my Entity table, I plan on looking up the index table by
dup key and add the current entity key in it?  I am worried about this look
per entity record for performance reasons?  To make things more complicated,
should be able to change the set of keys that define a dup.  I handle that
recreating my index table.

Is there a better way to write a dup check?

Thanks a lot for your help,

View raw message