hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: When to expand vertically vs. horizontally in Hbase
Date Fri, 05 Jul 2013 16:07:04 GMT
Why do you have so many column families (CF) ? 

Its not a question on the physical limitations, but more on the issue of data design. 

There aren't that many really good examples of where you would have multiple column families
that would require more than a handful of CFs. 

When I teach or lecture, the example I use is an order entry system.  Where you would have
the same key on Order entry, pick slips, shipping, and invoice. 

That's probably the best example of where CFs come in to play. 

I'd suggest that you go back and rethink the design if you're having more than a handful.

On Jul 5, 2013, at 8:53 AM, Aji Janis <aji1705@gmail.com> wrote:

> Asaf,
> I am using the Genre/Author stuff as an example but yes at the moment I
> only have 5 column families. However, over time I may have more (no upper
> limit decided that this point). See below for more responses
> On Wed, Jul 3, 2013 at 3:42 PM, Asaf Mesika <asaf.mesika@gmail.com> wrote:
>> Do you have only 5 static author names?
>> Keep in mind the column family name is defined when creating the table.
>> Regarding tall vs wide debate:
>> HBase is first and for most a Key Value database thus reads and writes in
>> the column-value level. So it doesn't really care about rows.
>> But it's not entirely true. Rows come into play in the following
>> situations:
>> Splitting a region is per row and not per column, thus a row will be saved
>> as a whole on a region. If you have a really large row, the region size
>> granularity is dependent on it. It doesn't seem to be the case here.
>> Put/Delete creates a lock until finished. If you are intensive on inserts
>> to the same row at the same time, thus might be bad for you, keeping your
>> rows slimmer can reduce contention, but again, only if you make a lot
>> concurrent modifications to the same row.
> I expect batches of Put/Delete to the same row to happen by at most one
> thread at a time based on user's current behavior. So locking shouldn't be
> an issue. However, not sure if the saving row to a region with enough space
> topic is really an issue I need to worry about (probably because I just
> don't know much about it yet).
>> Filtering - if you need a filter which need all the row (there is a method
>> you override in Filter to mark that) than a far row will be more memory
>> intensive. If you needed only 1/5 of your row, than maybe splitting it to 5
>> rows to begin with would have made a better schema design in terms of
>> memory and I/O.
> Currently, my access pattern is to get all data for a given row. Its
> possible in the future we may want to apply (family/qualifier) filters.
> There is a lot of uncertainty on use cases (client side) at this point
> which is why I am not entirely sure on how things will look months from
> now. I am not sure I follow this statement
> "if you need a filter which need all the row (there is a method you
> override in Filter to mark that) than a far row will be more memory
> intensive."
> Can you please explain? Thank you for these suggestions btw, good food for
> thought!
>> On Wednesday, July 3, 2013, Aji Janis wrote:
>>> I have a major typo in the question so I apologize. I meant to say 5
>>> families with 1000+ qualifiers each.
>>> Lets work with an example, (not the greatest example here but still).
>> Lets
>>> say we have a Genre Class like this:
>>> Class HistoryBooks{
>>> ArrayList<Books> author1;
>>> ArrayList<Books> author2;
>>> ArrayList<Books> author3;
>>> ArrayList<Books> author4;
>>> ArrayList<Books> author5;
>>> ...}
>>> Each author is a column family (lets say we only allow 5 authors per
>>> <T>Book class. Book per author ends up being the qualifier. In this
>> case, I
>>> know I have a max family count but my qualifiers have no upper limit. So
>> is
>>> this scenario a case for tall or wide table? Why? Thank you.
>>> On Tue, Jul 2, 2013 at 9:56 AM, Bryan Beaudreault
>>> <bbeaudreault@hubspot.com <javascript:;>>wrote:
>>>> If they are accessed mostly together they should all be a single column
>>>> family. The key with tall or wide is based on the total byte size of
>> each
>>>> KeyValue. Your cells would need to be quite large for 50 to become a
>>>> problem. I still would recommend using a single CF though.
>>>> —
>>>> Sent from iPhone

View raw message