hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Embedded table data model
Date Fri, 13 Jul 2012 14:55:10 GMT
A caveat... Schema design in HBase is one of the hardest things to teach/learn because its
so open. There is more than one correct answer when it comes to creating a good design...

Ian's presentation kind of tries to relate HBase schema design to relational modeling. 
From past experience, I found that to be a bit confusing and somewhat limiting because it
didn't allow the student to look beyond relational structures when thinking about the data.
(Its really hard to get ER modelers to make the transition. ) 

First, you can't think about HBase in terms of transactions. Transactional processing doesn't
exist in HBase and what HBase has in terms of RLL isn't the same in terms of transactional
processing RLL. 

Also there are problems w the concept of column families. If the data set size are not roughly
equal, you end up w a lot of small files for the one CF because when a region splits all CFs
split.  So unless you have a good reason to share the same key across multiple records, you
really don't want to use CFs. (Or use them sparingly.)

Note, I said records. 
That is because you need to think of your row of data as a self contained record. 
Think of when you go to your doctor's office and they pull out a hard copy of your medical
That folder (in my case, thick folder... ;-)  contains your entire patient medical history.

That folder would be synonymous to an HBase record/row. As you can see, you end up tossing
the relational model out the window. 

An example in terms of a PoS/Customer Order Entry  system. 

You could consider having a record of a customer's order. 
Then you can have one column family for the customer's relatively static information like
contacts, phone#, addresses, etc... 
One column family for Orders
One column family for Invoices
Once column family for Pick Slips

All based on a composite key of your customer_id and then order_num.

Since the column data is a byte array (everything is a byte array) the data stored in a column
could be a primitive data type or some more complex structure. 

In this example, I used CF's because these are records that are tied together by the same
key but serve different purposes. 
When you place an order, you generate one or more pick slips. 
You may also generate one or more invoices associated to that order or multiple orders if
you allow customers to have a consolidated account and bill monthly. 

A lot of the design depends on your data and its primary use case. 

As always YMMV.

On Jul 12, 2012, at 10:08 PM, Ian Varley wrote:

> Column families are not the same thing as columns. You should indeed have a small number
of column families, as that article points out. Columns (aka column qualifiers) are run-time
defined key/value pairs that contain the data for every row, and having large numbers of these
is fine. 
> On Jul 12, 2012, at 7:27 PM, "Cole" <heshuai64@gmail.com> wrote:
>> I think this design has some question, please refer
>> http://hbase.apache.org/book/number.of.cfs.html
>> 2012/7/12 Ian Varley <ivarley@salesforce.com>
>>> Yes, that's fine; you can always do a single column PUT into an existing
>>> row, in a concurrency-safe way, and the lock on the row is only held as
>>> long as it takes to do that. Because of HBase's Log-Structured Merge-Tree
>>> architecture, that's efficient because the PUT only goes to memory, and is
>>> merged with on-disk records at read time (until a regular flush or
>>> compaction happens).
>>> So even though you already have, say, 10K transactions in the table, it's
>>> still efficient to PUT a single new transaction in (whether that's in the
>>> middle of the sorted list of columns, at the end, etc.)
>>> Ian
>>> On Jul 11, 2012, at 11:27 PM, Xiaobo Gu wrote:
>>> but they are other writers insert new transactions into the table when
>>> customers do new transactions.
>>> On Thu, Jul 12, 2012 at 1:13 PM, Ian Varley <ivarley@salesforce.com
>>> <mailto:ivarley@salesforce.com>> wrote:
>>> Hi Xiaobo -
>>> For HBase, this is doable; you could have a single table in HBase where
>>> each row is a customer (with the customerid as the rowkey), and columns for
>>> each of the 300 attributes that are directly part of the customer entity.
>>> This is sparse, so you'd only take up space for the attributes that
>>> actually exist for each customer.
>>> You could then have (possibly in another column family, but not
>>> necessarily) an additional column for each transaction, where the column
>>> name is composed of a date concatenated with the transaction id, in which
>>> you store the 30 attributes as serialized into a single byte array in the
>>> cell value. (Or, you could alternately do each attribute as its own column
>>> but there's no advantage to doing so, since presumably a transaction is
>>> roughly like an immutable event that you wouldn't typically change just a
>>> single attribute of.) A schema for this (if spelled out in an xml
>>> representation) could be:
>>> <table name="customer">
>>> <key>
>>> <column name="customerid">
>>> </key>
>>> <columnfamily name="1">
>>> <column name="customer_attribute_1" />
>>> <column name="customer_attribute_2" />
>>> ...
>>> <column name="customer_attribute_300" />
>>> </columnFamily>
>>> <columnFamily name="2">
>>> <entity name="transaction" values="serialized">
>>>   <key>
>>>     <column name="transaction_date" type="date">
>>>     <column name="transaction_id" />
>>>   </key>
>>>   <column name="transaction_attribute_1" />
>>>   <column name="transaction_attribute_2" />
>>>   ...
>>>   <column name="transaction_attribute_30" />
>>> </entity>
>>> </columnFamily>
>>> </table>
>>> (This isn't real HBase syntax, it's just an abstract way to show you the
>>> structure.) In practice, HBase isn't doing anything "special" with the
>>> entity that lives nested inside your table; it's just a matter of
>>> convention, that you could "see" it that way. The customer-level attributes
>>> (like, say, "customer_name" and "customer_address") would be literal column
>>> names (aka column qualifiers) embedded in your code, whereas the
>>> transaction-oriented columns would be created at runtime with column names
>>> like "2012-07-11 12:34:56_TXN12345", and values that are simply collection
>>> objects (containing the 30 attributes) serialized into a byte array.
>>> In this scenario, you get fast access to any customer by ID, and further
>>> to a range of transactions by date (using, say, a column pagination
>>> filter). This would perform roughly equivalently regardless of how many
>>> customers are in the table, or how many transactions exist for each
>>> customer. What you'd lose on this design would be the ability to get a
>>> single transaction for a single customer by ID (since you're storing them
>>> by date). But if you need that, you could actually store it both ways. You
>>> also might be introducing some extra contention on concurrent transaction
>>> PUT requests for a single client, because they'd have to fight over a lock
>>> for the row (but that's probably not a big deal, since it's only
>>> contentious within each customer).
>>> You might find my presentation on designing HBase schemas (from this
>>> year's HBaseCon) useful:
>>> http://www.hbasecon.com/sessions/hbase-schema-design-2/
>>> Ian
>>> On Jul 11, 2012, at 10:58 PM, Xiaobo Gu wrote:
>>> Hi,
>>> I have technical problem, and wander whether HBase or Cassandra
>>> support Embedded table data model, or can somebody show me a way to do
>>> this:
>>> 1.We have a very large customer entity table which have 100 milliion
>>> rows, each customer row has about 300 attributes(columns).
>>> 2.Each customer do about 1000 transactions per year, each transaction
>>> has about 30 attributes(columns), and we just save one year
>>> transactions for each customer
>>> We want a data model that  we can get the customer entity with all the
>>> transactions which he did for a single client call within a fixed time
>>> window, according to the customer id (which is the primary key of the
>>> customer table). We do the following in RDBMS,
>>> A customer table with customerid as the primary key, A transaction
>>> table with customer id as a secondary index, and join them , or we
>>> must do two separate  calls, and because we have so many concurrent
>>> readers and these two tables are became so large, the RDBMS system
>>> performs poor.
>>> Can we embedded the transactions inside the customer table in HBase or
>>> Cassandra?
>>> Regards,
>>> Xiaobo Gu

View raw message