hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: Parent/child relation - go vertical, horizontal, or many tables?
Date Fri, 11 Feb 2011 20:22:30 GMT

David,

First a caveat... You need to have a realistic notion of the data and its sizes when considering
your options...
With respect to the response, Here's what I said: 
-=-
"With respect to your issue about a row being too large to fit in to memory... 
 This would imply that the row would be too large to fit in to a single 
region. Wouldn't that cause your HBase to die a horrible death?

 If this really is a potential situation, then you should consider the parent_key, child_id
compound row key..."
-=-
Now a correction. If you insert a row that is larger than a region, the region will grow to
fit the row and will not split. So until your row exceeds the size of available disk... you
can do it. So yeah you could fill up memory...

And that's the only reason why I would recommend option 2 over option 1.
So how real is this scenario? 

Looking at the 3 stated use cases...  Doing a get() on the parent ID will give you the entire
set of children for the parent in a single fetch.
If you limit the columns to either a single column or a set of columns, you are still going
to be a single get().

This is going to be faster than doing a scan() on a series of row starting with parent_id
stopping with parent_id+1.
(At least in theory. I haven't mocked this out and tried it.)

Again the only advantage of option 2 is if you really are worried about your data size blowing
you out of the water.
If you do find yourself using a lot of memory to fetch your edge cases, then you'd be better
off with the second option.

Here you have the following:

1) Fetching all of the children (scan() with a start and stop key)
2) Fetching some of the rows... (scan() with a start and stop key and some sort of filter);
3) Fetching single child (get() using a combination of parent_id, child_id for the key.)

So while you don't have to worry about the size of a row, you do not get the same performance
that you could with option 1.

Does that make sense?

-Mike





> From: buttler1@llnl.gov
> To: user@hbase.apache.org
> Date: Fri, 11 Feb 2011 10:45:14 -0800
> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> 
> Michael,
> Thanks for the analysis.  The thought process you put into this seems useful.  However,
following along at home I came to a different conclusion than you did.  I would prefer (sol.
2) over (sol. 3) for the reason you mention, but I would also strongly prefer (sol. 2) over
(sol. 1), also for the reason you mention.
> 
> So, I don't see how you can not recommend (sol. 2).  It seems like (sol. 1) would be
very wasteful for use cases (u2) and (u3). The only time it would help is in (u1).  And then
it doesn't seem obvious to me that a single row is better except in cases where there are
very few children per parent.
> 
> Perhaps if the data is expected to have a power law distribution (fat tail, zipfian),
a hybrid approach would be better: go with (sol. 1) for any parent that has fewer than (say
10) children.  But, after a parent fills up its first 10 children, start populating rows like
(sol. 2).
> 
> This would definitely make the client code more complex, so it would only make sense
if there were huge savings to be had.
> Maybe a slightly better implementation of the hybrid would be to divide the child key
space up into buckets so that you can directly address any child, but still have fewer calls
in retrieving all children.  Then you can adjust your bucket size based on your actual use
case (with a bucket size of 1 being the special case of (sol. 2)).
> 
> But the more I think about it, the more I suspect that the added complexity will not
be worth it, and he should just go with (sol. 2).
> 
> Dave
> 
> 
> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com] 
> Sent: Friday, February 11, 2011 5:51 AM
> To: user@hbase.apache.org
> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> 
> 
> Jason,
> 
> You have the following constraint:
> Foreach child there is one parent. A parent can have more than one child.
> 
> While you don't specify size of the child, when a parent can have tens of millions, that
could become an issue.
> Assuming that the child is relatively small...
> 
> You have 3 use cases: (Scan patterns)
> 
> > -Fetch all children from a single parent
> > -Find a few children by their keys or values from a single parent
> > -Update a single child by child key and it's parent key
> 
> Your options...
> 
> > 1. One table with one Parent per row. Row key is a parent id. 
> Children are stored in a single family each under separate qualifier 
> (child id). Would it even work assuming all children may not fit in 
> memory? 
> > 
> While you raise an interesting point, lets look at the schema as a solution.
> This works well because you can fetch the entire row based on parent key.
> So all queries are get()s and not scan()s.
> 
> You can then pull all of the existing columns where each column represents a child.
> 
> You can also do a get() of only those columns you want based on child_id as the column
name.
> 
> You can also do a get() or a put of a specific column (child_id) for a given parent (row
key).
> 
> 
> With respect to your issue about a row being too large to fit in to memory... 
> This would imply that the row would be too large to fit in to a single region. Wouldn't
that cause your HBase to die a horrible death?
> 
> If this really is a potential situation, then you should consider the parent_key, child_id
compound row key...
> 
> > 2. One table. Compound row key parent id/child id. One child per row. 
> > 
> Based on your use cases, I wouldn't recommend this. While it is a valid schema, it is
only 'optimal' for your 'Update a single child by child key and its parent key'. 
> 
> > 3. Many tables - one per parent. Row key is a child id.
> If you have a scenario of a parent has billions+ of children, the could be a valid choice,
however based on what you said, (up to tens of millions) and the data set is unique and non-intersecting,
you would be better off with a single table. (Too many tables is not a good thing in HBase.)
> 
> 
> HTH
> 
> -Mike
> 
> 
> > Subject: Parent/child relation - go vertical, horizontal, or many tables?
> > From: urgisb@gmail.com
> > Date: Thu, 10 Feb 2011 16:55:00 -0800
> > To: user@hbase.apache.org
> > 
> > Hi all,
> > 
> > Let's say I have two entities Parent and Child. There could be many children in
one parent (from hundreds to tens of millions)
> > A child can only belong to one Parent.
> > 
> > Typical queries are:
> > -Fetch all children from a single parent
> > -Find a few children by their keys or values from a single parent
> > -Update a single child by child key and it's parent key
> > 
> > And there are no cross-parent queries.
> > 
> > I am trying to figure out what is better schema approach from performance/maintenance
perspective:
> > 
> > 1. One table with one Parent per row. Row key is a parent id. Children are stored
in a single family each under separate qualifier (child id). Would it even work assuming all
children may not fit in memory? 
> > 
> > 2. One table. Compound row key parent id/child id. One child per row. 
> > 
> > 3. Many tables - one per parent. Row key is a child id.
> > 
> > Thanks!
>  		 	   		  
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message