hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Srikanth P. Shreenivas" <Srikanth_Shreeni...@mindtree.com>
Subject RE: Tall-Narrow vs. Flat-Wide Tables
Date Fri, 02 Sep 2011 12:32:19 GMT
Thanks Dave.

In that case, I guess a correction needs to be done in HBase Definitive Guide's first chapter
(http://ofps.oreilly.com/titles/9781449396107/intro.html), where it states:
As opposed to the limit on column families there is no such thing for the number of columns:
you could have millions of columns in a particular column family. There is also no type nor
length boundary on the column values.

If below example of Email schema design is an example of bad schema design, not because of
query/access pattern but because of the issue it can create for region splits, then,  the
above excerpt from the book should have a fine print ;-)


-----Original Message-----
From: Buttler, David [mailto:buttler1@llnl.gov] 
Sent: Friday, September 02, 2011 2:08 AM
To: user@hbase.apache.org
Subject: RE: Tall-Narrow vs. Flat-Wide Tables

The "HBase: The Definitive Guide" answer seems pretty, um, definitive to me.  The only reason
I would even consider going against that advice is if I had solid knowledge that it was impossible
for a user to have more than 100,000 emails.  But even then it seems like a difficult design
decision to justify.  How does that design help you do something?


-----Original Message-----
From: Srikanth P. Shreenivas [mailto:Srikanth_Shreenivas@mindtree.com] 
Sent: Thursday, September 01, 2011 11:53 AM
To: user@hbase.apache.org
Subject: Tall-Narrow vs. Flat-Wide Tables


HBase: The Definitive Guide book's chapter 9 talks about Tall-Narrow vs Flat-wide tables.

It seems to propose that Tall-Narrow tables (more rows, less columns) is better design.  One
of the issue it talks about with "Flat-wide" tables (less rows and more columns) is
In addition, HBase can only split at row boundaries, which also enforces the recommendation
to go with tall-narrow tables. Imagine you have all emails of a user in a single row. This
will work for the majority of users, but there will be outliers that will have magnitudes
of emails more in their inbox. So much so that a single row could outgrow the maximum file/region
size and work against the region split facility.

So, my query is that is it a bad idea to have a table as given in above example wherein emails
are stored by adding columns.   I seem to have a similar table in my application, wherein
I have a region size of 1GB and cell value of 10KB.  So, will I run into region-split issue
mentioned above after 100000 (1GB / 10KB = 100000)  columns.




View raw message