hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jg...@fb.com>
Subject RE: Schema design, one-to-many question
Date Tue, 30 Nov 2010 16:11:14 GMT
I'm not sure I agree that "you can not think of relationships".

There is in fact a one-to-many relationship in the described schema, regardless of the database
you are modeling it in.  I'm not sure why you would not want to think about these relations?

And a row being userid with a family containing a column per order seems like a very legitimate
way to store a one-to-many relationship.  However it's not exactly as you've described (primary
key == foreign key).  One primary key is userid, the other is orderid (or if orderid is only
unique per user, the second primary key is userid+orderid).  Relational thinking would be
three tables (users, orders, userorders), HBase thinking would be one table keyed on user.

What you need to "drop" from the relational model is the idea of breaking these things up
into different tables to join them later.  With column-orientation, you can have the user
as the row and stuff all of his relations into that same row.

JG

> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Tuesday, November 30, 2010 5:32 AM
> To: user@hbase.apache.org
> Subject: RE: Schema design, one-to-many question
> 
> 
> I'm sorry if this has already been answered, but I'll share my $0.02
> anyway...
> 
> 
> First, you and everyone have to stop thinking of hbase in terms of a
> relational model. Because Hbase doesn't have the concept of joins, you can
> not think of relationships.
> If you have two tables where the primary key of both tables are the same
> (primary key == foreign key) then you can put the two relational tables
> within the same table but in different column families.
> 
> Having said that... usually in this case you would store them in the same
> column family, however there may be a valid reason to separate them. (Row
> size and access patterns may make it a good idea to separate them out for
> performance reasons).
> 
> In your example, which is a classic example... don't think in terms of
> relational but think in terms of a hierarchy data base. Think in terms of
> Dick Pick.
> (See: http://en.wikipedia.org/wiki/Pick_operating_system )
> 
> Pick or Pick like systems were Revelation (Now I'm showing my age...) , U2
> aka Universe (VMark was acquired by Informix, then IBM, then spun off
> recently... I think) is a good example of how to model within HBase.
> 
> So if we look at your example... how do you plan on accessing the
> information?
> Since everyone likes to talk 'agile' ... think about your story lines....
> "A grasshopper walks in to a bar ... " (Sorry bad joke).
> In your case... A customer logs in to your website. Starts to place an
> order....
> When the customer logs in... you know his customer id. So you can keep the
> customer information in a separate table since you're not looking up the
> data immediately. You can then use the customer id as part of the key for
> your order table. In fact, I'd make it the first part of the composite key
> since the customer may not always remember their order numbers and you will
> want to search the order table (yes, its clear that you want a table for
> your orders too.) So your key could be customer id + order num.
> 
> Then when a customer wants to find his orders, you can do a scan with a
> start key and end key based on customer id.
> 
> Beyond this... there are strategies on how you create your customer id to
> get better utilization of your cloud.
> 
> NOTE THE FOLLOWING:
> This example and design does not use indexing. If you want 'real time'
> performance, you'll need to incorporate indexing in to your design, however
> that's another story...
> 
> HTH
> 
> -Mike
> 
> 
> > Date: Mon, 29 Nov 2010 13:40:58 -0800
> > Subject: Schema design, one-to-many question
> > From: bryanck@gmail.com
> > To: user@hbase.apache.org
> >
> > I have read comments on modeling one-to-many relationships in HBase and
> > wanted to get some feedback. I have millions of customers, and each
> customer
> > can make zero to thousands of orders. I want to store all of this data in
> > HBase. The data is always accessed by customer.
> >
> > It seems there are a few schema design approaches.
> >
> > Approach 1: Orders table. One row per order. Customer data is either
> > denormalized, or the customer ID is stored for lookup in a customer data
> > cache. Table will have billions of rows of a few columns each.
> >
> > key: customer ID + order ID
> > family 1: customer (customer:id)
> > family 2: order (order:id, order:amount, order:date, etc.)
> >
> > Approach 2: Customer table. One row per customer. All orders are stored
> in a
> > column family with order ID in the column name. Millions of rows with
> > potentially thousands of columns each.
> >
> > key: customer ID
> > family 1: customer (customer:id, customer:name, customer:city, etc.)
> > family 2: order (order:id_<id of order>, order:amount_<id of order>,
> > order:date_<id of order>)
> >
> > Approach 3: Same as #2, but store the order data as a serialized blob
> > instead of in separate columns:
> >
> > key: customer ID
> > family 1: customer (customer:id, customer:name, customer:city, etc.)
> > family 2: order (order:<id of order>)
> >
> > Approach 4: Not sure if this is viable, but same as #2 but use versions
> in
> > the order family to store multiple orders.
> >
> > key: customer ID
> > family 1: customer (customer:idm customer:name, customer:city, etc.)
> > family 2: order (order:id, order:amount, order:date, etc.) - 1000
> versions
> >
> > I am thinking approach #1 is probably the correct approach, but #2 and #3
> > (and #4?) would be more efficient from an application standpoint, as
> > everything is processed by customer and I won't need a customer data
> cache
> > or worry about updating denormalized data. Does anyone have feedback as
> to
> > what approaches work for them for data sets like this, and why?
> 

Mime
View raw message