hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jg...@fb.com>
Subject RE: Schema design, one-to-many question
Date Tue, 30 Nov 2010 00:14:48 GMT
Hey Bryan,

All of these approaches could work and seem sane.

My preference these days would be the wide-table approach (#2, 3, 4) rather than the tall
table.  Previously #1 was more efficient but in 0.90 and beyond the same optimizations exist
for both tall and wide tables.

For #2, I would probably structure the qualifier as <id_of_order>_fieldname (rather
than the other way around).  Then the fields for a given order are continuous (rather than
grouping by the fieldname).

If you have some existing serialization method you are using in your application, #3 would
make sense.

#4 wouldn't be ideal because HBase sorts on column before version, so fields for a given order
would not be continuous thus reads would be inefficient.  This is similar to the issue with
the ordering of id/field in #2.

The most important thing is to design this so you have efficient reads.  I imagine one of
the important queries is something like "get me all the info for this order".  If so, it would
be important that all fields for an order are together.


> -----Original Message-----
> From: Bryan Keller [mailto:bryanck@gmail.com]
> Sent: Monday, November 29, 2010 1:41 PM
> To: user@hbase.apache.org
> Subject: Schema design, one-to-many question
> I have read comments on modeling one-to-many relationships in HBase and
> wanted to get some feedback. I have millions of customers, and each
> customer
> can make zero to thousands of orders. I want to store all of this data in
> HBase. The data is always accessed by customer.
> It seems there are a few schema design approaches.
> Approach 1: Orders table. One row per order. Customer data is either
> denormalized, or the customer ID is stored for lookup in a customer data
> cache. Table will have billions of rows of a few columns each.
> key: customer ID + order ID
> family 1: customer (customer:id)
> family 2: order (order:id, order:amount, order:date, etc.)
> Approach 2: Customer table. One row per customer. All orders are stored in
> a
> column family with order ID in the column name. Millions of rows with
> potentially thousands of columns each.
> key: customer ID
> family 1: customer (customer:id, customer:name, customer:city, etc.)
> family 2: order (order:id_<id of order>, order:amount_<id of order>,
> order:date_<id of order>)
> Approach 3: Same as #2, but store the order data as a serialized blob
> instead of in separate columns:
> key: customer ID
> family 1: customer (customer:id, customer:name, customer:city, etc.)
> family 2: order (order:<id of order>)
> Approach 4: Not sure if this is viable, but same as #2 but use versions in
> the order family to store multiple orders.
> key: customer ID
> family 1: customer (customer:idm customer:name, customer:city, etc.)
> family 2: order (order:id, order:amount, order:date, etc.) - 1000 versions
> I am thinking approach #1 is probably the correct approach, but #2 and #3
> (and #4?) would be more efficient from an application standpoint, as
> everything is processed by customer and I won't need a customer data cache
> or worry about updating denormalized data. Does anyone have feedback as to
> what approaches work for them for data sets like this, and why?

View raw message