hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Buttler, David" <buttl...@llnl.gov>
Subject RE: Parent/child relation - go vertical, horizontal, or many tables?
Date Fri, 11 Feb 2011 18:45:14 GMT
Michael,
Thanks for the analysis.  The thought process you put into this seems useful.  However, following
along at home I came to a different conclusion than you did.  I would prefer (sol. 2) over
(sol. 3) for the reason you mention, but I would also strongly prefer (sol. 2) over (sol.
1), also for the reason you mention.

So, I don't see how you can not recommend (sol. 2).  It seems like (sol. 1) would be very
wasteful for use cases (u2) and (u3). The only time it would help is in (u1).  And then it
doesn't seem obvious to me that a single row is better except in cases where there are very
few children per parent.

Perhaps if the data is expected to have a power law distribution (fat tail, zipfian), a hybrid
approach would be better: go with (sol. 1) for any parent that has fewer than (say 10) children.
 But, after a parent fills up its first 10 children, start populating rows like (sol. 2).

This would definitely make the client code more complex, so it would only make sense if there
were huge savings to be had.
Maybe a slightly better implementation of the hybrid would be to divide the child key space
up into buckets so that you can directly address any child, but still have fewer calls in
retrieving all children.  Then you can adjust your bucket size based on your actual use case
(with a bucket size of 1 being the special case of (sol. 2)).

But the more I think about it, the more I suspect that the added complexity will not be worth
it, and he should just go with (sol. 2).

Dave


-----Original Message-----
From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Friday, February 11, 2011 5:51 AM
To: user@hbase.apache.org
Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?


Jason,

You have the following constraint:
Foreach child there is one parent. A parent can have more than one child.

While you don't specify size of the child, when a parent can have tens of millions, that could
become an issue.
Assuming that the child is relatively small...

You have 3 use cases: (Scan patterns)

> -Fetch all children from a single parent
> -Find a few children by their keys or values from a single parent
> -Update a single child by child key and it's parent key

Your options...

> 1. One table with one Parent per row. Row key is a parent id. 
Children are stored in a single family each under separate qualifier 
(child id). Would it even work assuming all children may not fit in 
memory? 
> 
While you raise an interesting point, lets look at the schema as a solution.
This works well because you can fetch the entire row based on parent key.
So all queries are get()s and not scan()s.

You can then pull all of the existing columns where each column represents a child.

You can also do a get() of only those columns you want based on child_id as the column name.

You can also do a get() or a put of a specific column (child_id) for a given parent (row key).


With respect to your issue about a row being too large to fit in to memory... 
This would imply that the row would be too large to fit in to a single region. Wouldn't that
cause your HBase to die a horrible death?

If this really is a potential situation, then you should consider the parent_key, child_id
compound row key...

> 2. One table. Compound row key parent id/child id. One child per row. 
> 
Based on your use cases, I wouldn't recommend this. While it is a valid schema, it is only
'optimal' for your 'Update a single child by child key and its parent key'. 

> 3. Many tables - one per parent. Row key is a child id.
If you have a scenario of a parent has billions+ of children, the could be a valid choice,
however based on what you said, (up to tens of millions) and the data set is unique and non-intersecting,
you would be better off with a single table. (Too many tables is not a good thing in HBase.)


HTH

-Mike


> Subject: Parent/child relation - go vertical, horizontal, or many tables?
> From: urgisb@gmail.com
> Date: Thu, 10 Feb 2011 16:55:00 -0800
> To: user@hbase.apache.org
> 
> Hi all,
> 
> Let's say I have two entities Parent and Child. There could be many children in one parent
(from hundreds to tens of millions)
> A child can only belong to one Parent.
> 
> Typical queries are:
> -Fetch all children from a single parent
> -Find a few children by their keys or values from a single parent
> -Update a single child by child key and it's parent key
> 
> And there are no cross-parent queries.
> 
> I am trying to figure out what is better schema approach from performance/maintenance
perspective:
> 
> 1. One table with one Parent per row. Row key is a parent id. Children are stored in
a single family each under separate qualifier (child id). Would it even work assuming all
children may not fit in memory? 
> 
> 2. One table. Compound row key parent id/child id. One child per row. 
> 
> 3. Many tables - one per parent. Row key is a child id.
> 
> Thanks!
 		 	   		  

Mime
View raw message