hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eric_...@yahoo.com
Subject evaluating HBase
Date Mon, 17 Jan 2011 12:22:05 GMT

I am currently evaluating HBase for an implementation of an ERP-like cloud 
solution that's supposed to handle 500M lines per year for the biggest tenant 
and 10-20m for the smaller tenants.  I am writing a couple prototypes, one using 
MySQL (sharded) and one with HBase - I will let you know what I find if you are 
interested.  Anyway, I have 2 questions:

The first one is regarding the following post and I would like to get a 
perspective from the no-sql camp on this one.

The second is regarding how to best implement a 'duplicate check' validation. 
 Here is what I have done so far: I have a single entity table and I have 
created an indexed table where the key is the concatenated value of the 4 
attributes of the entity (these 4 attributes are the definition of what 
constitutes a duplicate record while the entity can have around 100-150 
different attributes).  In this indexed table, I have a column in which I store 
a comma delimited list of all the keys that corresponds to entities that have 
the same set of 4 attribute values.

For example: (assuming that a dup is defined by entities having values of a and 
b be the same)

key, a, b, c, d, e
1, 1, 1, 1, 1, 1
2, 1, 1, 2, 2, 2
3, 1, 2, 2, 2, 2
4, 2, 2, 2, 2, 2

key, value
11, [1, 2]
12, [3]
22, [4]

When I scan through my Entity table, I plan on looking up the index table by the 
dup key and add the current entity key in it?  I am worried about this look up 
per entity record for performance reasons?  To make things more complicated, I 
should be able to change the set of keys that define a dup.  I handle that by 
recreating my index table.

Is there a better way to write a dup check?

Thanks a lot for your help,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message