hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiaobo Gu <guxiaobo1...@gmail.com>
Subject Re: Embedded table data model
Date Thu, 12 Jul 2012 06:27:07 GMT
but they are other writers insert new transactions into the table when
customers do new transactions.

On Thu, Jul 12, 2012 at 1:13 PM, Ian Varley <ivarley@salesforce.com> wrote:
> Hi Xiaobo -
> For HBase, this is doable; you could have a single table in HBase where each row is a
customer (with the customerid as the rowkey), and columns for each of the 300 attributes that
are directly part of the customer entity. This is sparse, so you'd only take up space for
the attributes that actually exist for each customer.
> You could then have (possibly in another column family, but not necessarily) an additional
column for each transaction, where the column name is composed of a date concatenated with
the transaction id, in which you store the 30 attributes as serialized into a single byte
array in the cell value. (Or, you could alternately do each attribute as its own column but
there's no advantage to doing so, since presumably a transaction is roughly like an immutable
event that you wouldn't typically change just a single attribute of.) A schema for this (if
spelled out in an xml representation) could be:
> <table name="customer">
>   <key>
>     <column name="customerid">
>   </key>
>   <columnfamily name="1">
>     <column name="customer_attribute_1" />
>     <column name="customer_attribute_2" />
>     ...
>     <column name="customer_attribute_300" />
>   </columnFamily>
>   <columnFamily name="2">
>     <entity name="transaction" values="serialized">
>       <key>
>         <column name="transaction_date" type="date">
>         <column name="transaction_id" />
>       </key>
>       <column name="transaction_attribute_1" />
>       <column name="transaction_attribute_2" />
>       ...
>       <column name="transaction_attribute_30" />
>     </entity>
>   </columnFamily>
> </table>
> (This isn't real HBase syntax, it's just an abstract way to show you the structure.)
In practice, HBase isn't doing anything "special" with the entity that lives nested inside
your table; it's just a matter of convention, that you could "see" it that way. The customer-level
attributes (like, say, "customer_name" and "customer_address") would be literal column names
(aka column qualifiers) embedded in your code, whereas the transaction-oriented columns would
be created at runtime with column names like "2012-07-11 12:34:56_TXN12345", and values that
are simply collection objects (containing the 30 attributes) serialized into a byte array.
> In this scenario, you get fast access to any customer by ID, and further to a range of
transactions by date (using, say, a column pagination filter). This would perform roughly
equivalently regardless of how many customers are in the table, or how many transactions exist
for each customer. What you'd lose on this design would be the ability to get a single transaction
for a single customer by ID (since you're storing them by date). But if you need that, you
could actually store it both ways. You also might be introducing some extra contention on
concurrent transaction PUT requests for a single client, because they'd have to fight over
a lock for the row (but that's probably not a big deal, since it's only contentious within
each customer).
> You might find my presentation on designing HBase schemas (from this year's HBaseCon)
> http://www.hbasecon.com/sessions/hbase-schema-design-2/
> Ian
> On Jul 11, 2012, at 10:58 PM, Xiaobo Gu wrote:
> Hi,
> I have technical problem, and wander whether HBase or Cassandra
> support Embedded table data model, or can somebody show me a way to do
> this:
> 1.We have a very large customer entity table which have 100 milliion
> rows, each customer row has about 300 attributes(columns).
> 2.Each customer do about 1000 transactions per year, each transaction
> has about 30 attributes(columns), and we just save one year
> transactions for each customer
> We want a data model that  we can get the customer entity with all the
> transactions which he did for a single client call within a fixed time
> window, according to the customer id (which is the primary key of the
> customer table). We do the following in RDBMS,
> A customer table with customerid as the primary key, A transaction
> table with customer id as a secondary index, and join them , or we
> must do two separate  calls, and because we have so many concurrent
> readers and these two tables are became so large, the RDBMS system
> performs poor.
> Can we embedded the transactions inside the customer table in HBase or
> Cassandra?
> Regards,
> Xiaobo Gu

View raw message