hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Varley <ivar...@salesforce.com>
Subject Re: When to expand vertically vs. horizontally in Hbase
Date Fri, 05 Jul 2013 18:26:05 GMT
Mike and I get into good discussions about ERD modeling and HBase a lot ... :)

Mike's right that you should avoid a design that relies heavily on relationships when modeling
data in HBase, because relationships are tricky (they're the first thing that gets throw out
the window in a database that can scale to huge data sets, because enforcing them is more
trouble than its worth; as is supporting normalization, joins, etc). If you start with a traditional
ERD, you're more likely to fall into this trap, because you're "used to" normalizing the crap
out of your entities.

But, something just occurred to me: just because your physical implementation (HBase) doesn't
support normalized entities and relationships doesn't mean your *problem* doesn't have entities
and relationships. :) An Author is one entity, a Title is another, and a Genre is a third.
Understanding how they interact is a prerequisite for translating into a physical model that
works well in HBase. (ERD modeling is not categorically the only way to understand that, but
I've yet to hear a credible alternative that doesn't boil down to either ERD or "do it in
your head").

Once you understand what your entities really are, and how they relate to each other, you
have pretty limited choices for how to represent multiple independent entities in HBase:

1) In unrelated tables. You just put authors in one table, titles in another, and genres in
a third. You do all the work of joining and maintaining cross-entity integrity yourself (if
needed). This is the default mode in HBase: "you worry about it". And that works great in
many simple cases. This is appropriate if your "hard problem" is scaling a small set of simple
entities to massive size, and you can take the hit for the application complexity that follows.

2) Scrunched into one table. You figure out the most important entity, and make that *the*
table, with all other data stuffed into it. In simple cases, this could be columns that hold
JSON; in advanced cases, you could use many columns to "nest" other entities in an intra-row
version of denormalization. For example, have the row key of the HBase table be something
like "Author ID", and then have a repeating column series for their titles, with column names
like "title:1234", "title:5678", etc. This isn't a very common model, because you have to
jump through some hoops in HBase (e.g. in this model, the way you would scan over authors
differs from how you'd "scan over" titles for an author or across authors). The only real
advantage to this over other forms of denormalization is that HBase guarantees intra-row ACID
properties, so you're guaranteed to get all or none of the updates to the row (i.e. you don't
have to reason about the failure cases). This can (but does *not* have to) use different column
families for the different "entities" inside the row.

3) Denormalized across many tables. When you write to HBase, you write in multiple layouts:
the Author table also contains a list of their titles, the Title table has author name &
other info, etc. This basically equates to doing extra work at write time so you don't have
to write code that does arbitrary joins and index usage at read time; in exchange, you get
slower and more complex writes, but faster and simpler reads from different access paths.
(It's still quite tricky, because you have to handle failure cases--what if one table gets
written but the other doesn't?)

4) Normalized, with help from custom coprocessors. You could write your own suite of coprocessors
to automatically do database-like things for you, such as joins and secondary indexing. I
wouldn't recommend this route unless you're doing them in a general enough way to share. For
example, Phoenix has an aggregation component that's built as a coprocessor and works really
well; and it's applicable to anyone who wants to use Phoenix. You could build more stuff on
this SQL framework, like indexes and joins and cascaded relationships and stuff. But that's
a pretty massive undertaking for a single use case. :)

Maybe there are others I'm not thinking of, but I think these are basically your only choices.
Mike, can you think of other basic approaches to representing more than one entity in HBase
(where entity is defined as some repeating element in your data storage where individual instances
are uniquely identifiable, possibly with one or more additional attributes)?


On Jul 5, 2013, at 12:48 PM, Michael Segel wrote:

Sorry, but you missed the point.

(Note: This is why I keep trying to put a talk at Strata and the other conferences on Schema
design yet for some reason... it just doesn't seem important enough or sexy enough... maybe
if I worked for Cloudera/Intel/etc ...  ;-)


The issue is what is and how to use Column families.

Since they are a separate HFile that uses the same key, the question is why do you need it
and when do you want to use it.

The answer unfortunately is a bit more complicated than the questions.

You have to ask yourself when do you have a series of tables which have the same key value?
How do you access this data?

It gets more involved, but just looking at the answers to those two questions is a start.

Like I said, think about the order entry example and how the data is used in those column

Please also remember that you are NOT WORKING IN A RELATIONAL MODEL. Sorry to shout that last
part, but its a very important concept. You need to stop thinking in terms of ERD when there
is no relationship. Column families tend to create a weak relationship... which makes them
a bit more confusing....

On Jul 5, 2013, at 11:16 AM, Aji Janis <aji1705@gmail.com<mailto:aji1705@gmail.com>>

I understand that there shouldn't be unlimited number of column families. I
am using this example on purpose to see how it comes into play.

On Fri, Jul 5, 2013 at 12:07 PM, Michael Segel <michael_segel@hotmail.com<mailto:michael_segel@hotmail.com>>wrote:

Why do you have so many column families (CF) ?

Its not a question on the physical limitations, but more on the issue of
data design.

There aren't that many really good examples of where you would have
multiple column families that would require more than a handful of CFs.

When I teach or lecture, the example I use is an order entry system.
Where you would have the same key on Order entry, pick slips, shipping,
and invoice.

That's probably the best example of where CFs come in to play.

I'd suggest that you go back and rethink the design if you're having more
than a handful.

On Jul 5, 2013, at 8:53 AM, Aji Janis <aji1705@gmail.com<mailto:aji1705@gmail.com>>


I am using the Genre/Author stuff as an example but yes at the moment I
only have 5 column families. However, over time I may have more (no upper
limit decided that this point). See below for more responses

On Wed, Jul 3, 2013 at 3:42 PM, Asaf Mesika <asaf.mesika@gmail.com<mailto:asaf.mesika@gmail.com>>

Do you have only 5 static author names?
Keep in mind the column family name is defined when creating the table.

Regarding tall vs wide debate:
HBase is first and for most a Key Value database thus reads and writes
the column-value level. So it doesn't really care about rows.
But it's not entirely true. Rows come into play in the following
Splitting a region is per row and not per column, thus a row will be
as a whole on a region. If you have a really large row, the region size
granularity is dependent on it. It doesn't seem to be the case here.
Put/Delete creates a lock until finished. If you are intensive on
to the same row at the same time, thus might be bad for you, keeping
rows slimmer can reduce contention, but again, only if you make a lot
concurrent modifications to the same row.

I expect batches of Put/Delete to the same row to happen by at most one
thread at a time based on user's current behavior. So locking shouldn't
an issue. However, not sure if the saving row to a region with enough
topic is really an issue I need to worry about (probably because I just
don't know much about it yet).

Filtering - if you need a filter which need all the row (there is a
you override in Filter to mark that) than a far row will be more memory
intensive. If you needed only 1/5 of your row, than maybe splitting it
to 5
rows to begin with would have made a better schema design in terms of
memory and I/O.

Currently, my access pattern is to get all data for a given row. Its
possible in the future we may want to apply (family/qualifier) filters.
There is a lot of uncertainty on use cases (client side) at this point
which is why I am not entirely sure on how things will look months from
now. I am not sure I follow this statement

"if you need a filter which need all the row (there is a method you
override in Filter to mark that) than a far row will be more memory

Can you please explain? Thank you for these suggestions btw, good food

On Wednesday, July 3, 2013, Aji Janis wrote:

I have a major typo in the question so I apologize. I meant to say 5
families with 1000+ qualifiers each.

Lets work with an example, (not the greatest example here but still).
say we have a Genre Class like this:

Class HistoryBooks{

ArrayList<Books> author1;
ArrayList<Books> author2;
ArrayList<Books> author3;
ArrayList<Books> author4;
ArrayList<Books> author5;


Each author is a column family (lets say we only allow 5 authors per
<T>Book class. Book per author ends up being the qualifier. In this
case, I
know I have a max family count but my qualifiers have no upper limit.
this scenario a case for tall or wide table? Why? Thank you.

On Tue, Jul 2, 2013 at 9:56 AM, Bryan Beaudreault
<bbeaudreault@hubspot.com<mailto:bbeaudreault@hubspot.com> <javascript:;>>wrote:

If they are accessed mostly together they should all be a single
family. The key with tall or wide is based on the total byte size of
KeyValue. Your cells would need to be quite large for 50 to become a
problem. I still would recommend using a single CF though.
Sent from iPhone

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message