hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: One-table w/ multi-CF or multi-table w/ one-CF?
Date Tue, 09 Sep 2014 21:02:05 GMT
Locality? 

Then the data should be in the same column family.  That’s as local as you can get. 

I would suggest that you think of the following:

What’s the predominant use case? 
How are you querying the data. 
If you’re always hitting multiple CFs to get the data… then you should have it in the
same table. 

I think more people would benefit if they took more time thinking about their design and how
the data is being used and stored… it would help. 
Also knowing that there really isn’t a single ‘right’ answer. Just a lot of wrong ones.
;-) 


Most people still try to think of HBase in terms of relational modeling and not in terms of
records and more of a hierarchial system. 
Things like CFs and Versioning are often misused because people see them as shortcuts. 

Also people tend not to think of their data in HBase in terms of 3D but in terms of 2D. 
(CF’s would be 2+D) 

The one question which really hasn’t been answered is how fat is fat in terms of a row’s
width and when is it too fat? 
This may seem like a simple thing, but it can impact a couple of things in your design. (I
never got a good answer, and its one of those questions that if your wife were to ask if the
pants she’s wearing makes her fat, its time to run for the hills because you can’t win
no matter how you answer!) 
Seriously though, the optimal width of the column is not that easy to answer and sometimes
you have to just guess as to which would be a better design. 

One of the problems with CFs is that if there’s an imbalance in terms of the size of data
being stored in each CF, you can run in to issues. 
CFs are stored in separate files and split when the base CF splits. (Assuming you have a base
CF and then multiple CFs that are related but store smaller records per row.) 
And then there’s the issue in terms of each CF is stored separately. (If memory serves its
a separate file per CF, but right now my last living brain cell decided to call it quits and
went on strike for more beer.) 
[Damn you last brain cell!!!] :-) 

Again the idea is to follow KISS. 

HTH

-Mike

On Sep 8, 2014, at 7:17 AM, Jianshi Huang <jianshi.huang@gmail.com> wrote:

> Locality is important, that why I chose CF to put related data into one
> group. I can surely put the CF part to the head of rowkey to achieve
> similar result, but since the number of types is fixed, I don't any benefit
> doing that.
> 
> With the setLoadColumnFamiliesOnDemand I learned from Ted, looks like the
> performance should be similar.
> 
> Am I missing something? Please enlighten me.
> 
> Jianshi
> 
> On Mon, Sep 8, 2014 at 3:41 AM, Michael Segel <michael_segel@hotmail.com>
> wrote:
> 
>> I would suggest rethinking column families and look at your potential for
>> a slightly different row key.
>> 
>> Going with column families doesn’t really make sense.
>> 
>> Also how wide are the rows? (worst case?)
>> 
>> one idea is to make type part of the RK…
>> 
>> HTH
>> 
>> -Mike
>> 
>> On Sep 7, 2014, at 2:40 AM, Jianshi Huang <jianshi.huang@gmail.com> wrote:
>> 
>>> Hi Michael,
>>> 
>>> Thanks for the questions.
>>> 
>>> I'm modeling dynamic Graphs in HBase, all elements (vertices, edges)
>> have a
>>> timestamp and I can query things like events between A and B for the
>> last 7
>>> days.
>>> 
>>> CFs are used for grouping different types of data for the same account.
>>> However, I have lots of skews in the data, to avoid having too much for
>> the
>>> same row, I had to put what was in CQs to now RKs. So CF now acts more
>> like
>>> a table.
>>> 
>>> There's one CF containing sequence of events ordered by timestamp, and
>> this
>>> CF is quite different as the use case is mostly in mapreduce jobs.
>>> 
>>> Jianshi
>>> 
>>> 
>>> 
>>> 
>>> On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel <michael_segel@hotmail.com
>>> 
>>> wrote:
>>> 
>>>> Again, a silly question.
>>>> 
>>>> Why are you using column families?
>>>> 
>>>> Just to play devil’s advocate in terms of design, why are you not
>> treating
>>>> your row as a record? Think hierarchal not relational.
>>>> 
>>>> This really gets in to some design theory.
>>>> 
>>>> Think Column Family as a way to group data that has the same row key,
>>>> reference the same thing, yet the data in each column family is used
>>>> separately.
>>>> The example I always turn to when teaching, is to think of an order
>> entry
>>>> system at a retailer.
>>>> 
>>>> You generate data which is segmented by business process. (order entry,
>>>> pick slips, shipping, invoicing) All reflect a single order, yet the
>> data
>>>> in each process tends to be accessed separately.
>>>> (You don’t need the order entry when using the pick slip to pull orders
>>>> from the warehouse.)  So here, the data access pattern is that each
>> column
>>>> family is used separately, except in generating the data (the order
>> entry
>>>> is used to generate the pick slip(s) and set up things like backorders
>> and
>>>> then the pick process generates the shipping slip(s) etc …  And since
>> they
>>>> are all focused on the same order, they have the same row key.
>>>> 
>>>> So its reasonable to ask how you are accessing the data and how you are
>>>> designing your HBase model?
>>>> 
>>>> Many times,  developers create a model using column families because the
>>>> developer is thinking in terms of relationships. Not access patterns on
>> the
>>>> data.
>>>> 
>>>> Does this make sense?
>>>> 
>>>> 
>>>> On Sep 6, 2014, at 7:46 PM, Jianshi Huang <jianshi.huang@gmail.com>
>> wrote:
>>>> 
>>>>> BTW, a little explanation about the binning I mentioned.
>>>>> 
>>>>> Currently the rowkey looks like <type_of_events>#<rev_timestamp>#<id>.
>>>>> 
>>>>> And with binning, it looks like
>>>>> <bin_number>#<type_of_events>#<rev_timestamp>#<id>.
The bin_number
>> could
>>>> be
>>>>> id % 256 or timestamp % 256. And the table could be pre-splitted. So
>>>> future
>>>>> ingestions could do parallel insertion to #<bin> regions, even
without
>>>>> pre-split.
>>>>> 
>>>>> 
>>>>> Jianshi
>>>>> 
>>>>> 
>>>>> On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang <jianshi.huang@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Each range might span multiple regions, depending on the data size
I
>>>> want
>>>>>> scan for MR jobs.
>>>>>> 
>>>>>> The ranges are dynamic, specified by the user, but the number of
bins
>>>> can
>>>>>> be static (when the table/schema is created).
>>>>>> 
>>>>>> Jianshi
>>>>>> 
>>>>>> 
>>>>>> On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu <yuzhihong@gmail.com>
wrote:
>>>>>> 
>>>>>>> bq. 16 to 256 ranges
>>>>>>> 
>>>>>>> Would each range be within single region or the range may span
>> regions
>>>> ?
>>>>>>> Are the ranges dynamic ?
>>>>>>> 
>>>>>>> Using command line for multiple ranges would be out of question.
A
>> file
>>>>>>> with ranges is needed.
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang <
>>>> jianshi.huang@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Thanks Ted for the reference.
>>>>>>>> 
>>>>>>>> That's right, extend the row.start and row.end to specify
multiple
>>>>>>> ranges
>>>>>>>> and also getSplits.
>>>>>>>> 
>>>>>>>> I would probably bin the event sequence CF into 16 to 256
bins. So
>> 16
>>>> to
>>>>>>>> 256 ranges.
>>>>>>>> 
>>>>>>>> Jianshi
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu <yuzhihong@gmail.com>
wrote:
>>>>>>>> 
>>>>>>>>> Please refer to HBASE-5416 Filter on one CF and if a
match, then
>> load
>>>>>>> and
>>>>>>>>> return full row
>>>>>>>>> 
>>>>>>>>> bq. to extend TableInputFormat to accept multiple row
ranges
>>>>>>>>> 
>>>>>>>>> You mean extending hbase.mapreduce.scan.row.start and
>>>>>>>>> hbase.mapreduce.scan.row.stop so that multiple ranges
can be
>>>>>>> specified ?
>>>>>>>>> How many such ranges do you normally need ?
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang <
>>>>>>> jianshi.huang@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Thanks Ted,
>>>>>>>>>> 
>>>>>>>>>> I'll pre-split the table during ingestion. The reason
to keep the
>>>>>>>> rowkey
>>>>>>>>>> monotonic is for easier working with TableInputFormat,
otherwise I
>>>>>>>>> would've
>>>>>>>>>> binned it into 256 splits. (well, I think a good
way is to extend
>>>>>>>>>> TableInputFormat to accept multiple row ranges, if
there's an
>>>>>>> existing
>>>>>>>>>> efficient implementation, please let me know :)
>>>>>>>>>> 
>>>>>>>>>> Would you elaborate a little more on the heap memory
usage during
>>>>>>> scan?
>>>>>>>>> Is
>>>>>>>>>> there any reference to that?
>>>>>>>>>> 
>>>>>>>>>> Jianshi
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu <yuzhihong@gmail.com>
>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> If you use monotonically increasing rowkeys,
separating out the
>>>>>>>> column
>>>>>>>>>>> family into a new table would give you same issue
you're facing
>>>>>>>> today.
>>>>>>>>>>> 
>>>>>>>>>>> Using a single table, essential column family
feature would
>> reduce
>>>>>>>> the
>>>>>>>>>>> amount of heap memory used during scan. With
two tables, there is
>>>>>>> no
>>>>>>>>> such
>>>>>>>>>>> facility.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang
<
>>>>>>>>> jianshi.huang@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Ted,
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, that's the table having RegionTooBusyExceptions
:) But the
>>>>>>>>>>> performance
>>>>>>>>>>>> I care most are scan performance.
>>>>>>>>>>>> 
>>>>>>>>>>>> It's mostly for analytics, so I don't care
much about atomicity
>>>>>>>>>>> currently.
>>>>>>>>>>>> 
>>>>>>>>>>>> What's your suggestion?
>>>>>>>>>>>> 
>>>>>>>>>>>> Jianshi
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu <yuzhihong@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Is this the same table you mentioned
in the thread about
>>>>>>>>>>>>> RegionTooBusyException
>>>>>>>>>>>>> ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If you move the column family to another
table, you may have
>>>>>>> to
>>>>>>>>>> handle
>>>>>>>>>>>>> atomicity yourself - currently atomic
operations are within
>>>>>>>> region
>>>>>>>>>>>>> boundaries.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sat, Sep 6, 2014 at 9:49 AM, Jianshi
Huang <
>>>>>>>>>> jianshi.huang@gmail.com
>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm currently putting everything
into one table (to make
>>>>>>> cross
>>>>>>>>>>>> reference
>>>>>>>>>>>>>> queries easier) and there's one CF
which contains rowkeys
>>>>>>> very
>>>>>>>>>>>> different
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> the rest. Currently it works well,
but I'm wondering if it
>>>>>>> will
>>>>>>>>>> cause
>>>>>>>>>>>>>> performance issues in the future.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So my questions are
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1) will there be performance penalties
in the way I'm doing?
>>>>>>>>>>>>>> 2) should I move that CF to a separate
table?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Jianshi Huang
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> LinkedIn: jianshi
>>>>>>>>>>>>>> Twitter: @jshuang
>>>>>>>>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Jianshi Huang
>>>>>>>>>>>> 
>>>>>>>>>>>> LinkedIn: jianshi
>>>>>>>>>>>> Twitter: @jshuang
>>>>>>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Jianshi Huang
>>>>>>>>>> 
>>>>>>>>>> LinkedIn: jianshi
>>>>>>>>>> Twitter: @jshuang
>>>>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Jianshi Huang
>>>>>>>> 
>>>>>>>> LinkedIn: jianshi
>>>>>>>> Twitter: @jshuang
>>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Jianshi Huang
>>>>>> 
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Jianshi Huang
>>>>> 
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Jianshi Huang
>>> 
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>> 
>> 
> 
> 
> -- 
> Jianshi Huang
> 
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/


Mime
View raw message