hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: HBase Questions
Date Sun, 03 May 2015 15:43:39 GMT
For #1, 

You really don’t want to do what is suggested by the HBase book. 
Yes you can do it, but then again, just because you can do something doesn’t mean you should.
Its really bad advice. 

HBase is IRT not CRUD.  
(IRT == Insert, Read, Tombstone) 

If there is a temporal component to your data, store them in different cells where time becomes
part of your column descriptor. 
So far of the use cases, Splice Machines’s relational model seems to make the most of the
versioning. They can control the depth and timeouts when they roll back transactions… this
is where tombstones come in to play. (Although isolation levels and RDBMS RLL comes in to
play.) [Note RLL in HBase != RDBMS RLL]

For #2,

Why use SHA1+document ID? 

While SHA1 may have collisions, I can’t recall every seeing one, although its feasibly possible
with a large enough data set. 
SHA1 and SHA2 are slower than MD5.  

If you’re going to want to have a somewhat even distribution, you could use the MD5 hash
which is faster, truncate that and prepend it to the document ID. 

If the Document IDs are not being inserted in sequence, you shouldn’t have to worry about
hot spotting. 

If you use the Hash, you lose the ability to do range scans, therefore you have to know your
document ID in order to generate the hash and get your document. 
That’s your only access method besides a full table scan, or using secondary indexes. 

> On May 3, 2015, at 9:37 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> For #1, see http://hbase.apache.org/book.html#versions and
> http://hbase.apache.org/book.html#schema.versions
> Cheers
> On Fri, May 1, 2015 at 9:17 PM, Arun Patel <arunp.bigdata@gmail.com> wrote:
>> 1) Are there any problems having many versions for a column family?  What's
>> the recommended limit?
>> 2) We have created a table for storing documents related data.  All
>> applications in our company are storing their documents data in same table
>> with rowkey as SHA1+Document ID.  Table is growing pretty rapidly.  I am
>> not seeing any issues as of now.  But, what kind of problems can be
>> expected with this approach in future?  First of all, Is this approach
>> correct?
>> Thanks,
>> Arun

The opinions expressed here are mine, while they may reflect a cognitive thought, that is
purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

View raw message