hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: Suggested and max number of CFs per table
Date Thu, 17 Mar 2011 15:10:17 GMT

Otis, you sure are busy blogging. ;-)

Ok but to answer your question... you want as few column families as possible.

When we first started looking at HBase, we tried to view the column families as if they were
relational tables and the key was a foreign key joining the two tables.
(Its actually not a bad way for RDBMs data modelers to look at a column oriented database
for the first time....)

The trouble is that when you take someone who follows 3rd normal form design, you end up reading
from two or more column families at the same time. This is where your problems begin because
the data is actually stored in separate files, so you take a performance hit.

With respect to your example... 
What's the data access patterns? Are they discrete between tenants?
As long as the data access is discrete between tenants and the tenants write to only one bucket,
you can do what you suggest.
But here's something to consider...
You are going to want to know your tenant's retention policy before you attempt to get the
data. This means you read from one column family when you do your get() and not all of them,
right? ;-)



> Date: Wed, 16 Mar 2011 23:30:14 -0700
> From: otis_gospodnetic@yahoo.com
> Subject: Suggested and max number of CFs per table
> To: user@hbase.apache.org
> Hi,
> My Q is around the suggested or maximum number of CFs per table (see 
> http://hbase.apache.org/book/schema.html#number.of.cfs )
> Consider the following use-case.
> * A multi-tenant system.
> * All tenants write data to the same table.
> * Tenants have different data retention policies.
> For the above use case I thought one could then just have different CFs with 
> different TTLs because Stack suggested relying on HBase's ability to purge old 
> rows by applying CF-specific TTLs: http://search-hadoop.com/m/VAeb52cvWHV.  
> These CFs would have the same set of columns, just different TTLs.  Then tenants 
> who want to keep only last 1 month's worth of data go to the CF where TTL=1 
> month, tenants who want to keep last 6 months of data go to CF where TTL=6 
> months, and so on.  However, tenants are not going to be evenly distributed - 
> there will be more tenants with shorter data retention periods, which means the 
> CFs where these tenants have their data will grow faster.
> If I'm reading http://hbase.apache.org/book/schema.html#number.of.cfs correctly, 
> the advice is not to have more than 2-3 CFs per table?
> And what happens if I have say 6 CFs per table?
> Again if I read the above page correctly, the problem is that uneven data 
> distribution will mean that whenever 1 of my CFs needs to be flushed, the 
> remaining 5 CFs will also get flushed at the same time, and this may (or will?) 
> trigger compaction for all CFs' files creating a sudden IO hit?
> Is there a good solution for this problem?
> Should one then have 6 different tables, each with just 1 CF instead of having 1 
> table with 6 CFs?
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message