hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: One-table w/ multi-CF or multi-table w/ one-CF?
Date Sat, 06 Sep 2014 20:52:18 GMT
Again, a silly question. 

Why are you using column families? 

Just to play devil’s advocate in terms of design, why are you not treating your row as a
record? Think hierarchal not relational. 

This really gets in to some design theory. 

Think Column Family as a way to group data that has the same row key, reference the same thing,
yet the data in each column family is used separately. 
The example I always turn to when teaching, is to think of an order entry system at a retailer.


You generate data which is segmented by business process. (order entry, pick slips, shipping,
invoicing) All reflect a single order, yet the data in each process tends to be accessed separately.

(You don’t need the order entry when using the pick slip to pull orders from the warehouse.)
 So here, the data access pattern is that each column family is used separately, except in
generating the data (the order entry is used to generate the pick slip(s) and set up things
like backorders and then the pick process generates the shipping slip(s) etc …  And since
they are all focused on the same order, they have the same row key.

So its reasonable to ask how you are accessing the data and how you are designing your HBase
model? 

Many times,  developers create a model using column families because the developer is thinking
in terms of relationships. Not access patterns on the data. 

Does this make sense? 

 
On Sep 6, 2014, at 7:46 PM, Jianshi Huang <jianshi.huang@gmail.com> wrote:

> BTW, a little explanation about the binning I mentioned.
> 
> Currently the rowkey looks like <type_of_events>#<rev_timestamp>#<id>.
> 
> And with binning, it looks like
> <bin_number>#<type_of_events>#<rev_timestamp>#<id>. The bin_number
could be
> id % 256 or timestamp % 256. And the table could be pre-splitted. So future
> ingestions could do parallel insertion to #<bin> regions, even without
> pre-split.
> 
> 
> Jianshi
> 
> 
> On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang <jianshi.huang@gmail.com>
> wrote:
> 
>> Each range might span multiple regions, depending on the data size I want
>> scan for MR jobs.
>> 
>> The ranges are dynamic, specified by the user, but the number of bins can
>> be static (when the table/schema is created).
>> 
>> Jianshi
>> 
>> 
>> On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>> 
>>> bq. 16 to 256 ranges
>>> 
>>> Would each range be within single region or the range may span regions ?
>>> Are the ranges dynamic ?
>>> 
>>> Using command line for multiple ranges would be out of question. A file
>>> with ranges is needed.
>>> 
>>> Cheers
>>> 
>>> 
>>> On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang <jianshi.huang@gmail.com>
>>> wrote:
>>> 
>>>> Thanks Ted for the reference.
>>>> 
>>>> That's right, extend the row.start and row.end to specify multiple
>>> ranges
>>>> and also getSplits.
>>>> 
>>>> I would probably bin the event sequence CF into 16 to 256 bins. So 16 to
>>>> 256 ranges.
>>>> 
>>>> Jianshi
>>>> 
>>>> 
>>>> 
>>>> On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>> 
>>>>> Please refer to HBASE-5416 Filter on one CF and if a match, then load
>>> and
>>>>> return full row
>>>>> 
>>>>> bq. to extend TableInputFormat to accept multiple row ranges
>>>>> 
>>>>> You mean extending hbase.mapreduce.scan.row.start and
>>>>> hbase.mapreduce.scan.row.stop so that multiple ranges can be
>>> specified ?
>>>>> How many such ranges do you normally need ?
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> 
>>>>> On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang <
>>> jianshi.huang@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Thanks Ted,
>>>>>> 
>>>>>> I'll pre-split the table during ingestion. The reason to keep the
>>>> rowkey
>>>>>> monotonic is for easier working with TableInputFormat, otherwise
I
>>>>> would've
>>>>>> binned it into 256 splits. (well, I think a good way is to extend
>>>>>> TableInputFormat to accept multiple row ranges, if there's an
>>> existing
>>>>>> efficient implementation, please let me know :)
>>>>>> 
>>>>>> Would you elaborate a little more on the heap memory usage during
>>> scan?
>>>>> Is
>>>>>> there any reference to that?
>>>>>> 
>>>>>> Jianshi
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu <yuzhihong@gmail.com>
wrote:
>>>>>> 
>>>>>>> If you use monotonically increasing rowkeys, separating out the
>>>> column
>>>>>>> family into a new table would give you same issue you're facing
>>>> today.
>>>>>>> 
>>>>>>> Using a single table, essential column family feature would reduce
>>>> the
>>>>>>> amount of heap memory used during scan. With two tables, there
is
>>> no
>>>>> such
>>>>>>> facility.
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang <
>>>>> jianshi.huang@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Ted,
>>>>>>>> 
>>>>>>>> Yes, that's the table having RegionTooBusyExceptions :) But
the
>>>>>>> performance
>>>>>>>> I care most are scan performance.
>>>>>>>> 
>>>>>>>> It's mostly for analytics, so I don't care much about atomicity
>>>>>>> currently.
>>>>>>>> 
>>>>>>>> What's your suggestion?
>>>>>>>> 
>>>>>>>> Jianshi
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu <yuzhihong@gmail.com>
>>>> wrote:
>>>>>>>> 
>>>>>>>>> Is this the same table you mentioned in the thread about
>>>>>>>>> RegionTooBusyException
>>>>>>>>> ?
>>>>>>>>> 
>>>>>>>>> If you move the column family to another table, you may
have
>>> to
>>>>>> handle
>>>>>>>>> atomicity yourself - currently atomic operations are
within
>>>> region
>>>>>>>>> boundaries.
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang <
>>>>>> jianshi.huang@gmail.com
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I'm currently putting everything into one table (to
make
>>> cross
>>>>>>>> reference
>>>>>>>>>> queries easier) and there's one CF which contains
rowkeys
>>> very
>>>>>>>> different
>>>>>>>>> to
>>>>>>>>>> the rest. Currently it works well, but I'm wondering
if it
>>> will
>>>>>> cause
>>>>>>>>>> performance issues in the future.
>>>>>>>>>> 
>>>>>>>>>> So my questions are
>>>>>>>>>> 
>>>>>>>>>> 1) will there be performance penalties in the way
I'm doing?
>>>>>>>>>> 2) should I move that CF to a separate table?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> --
>>>>>>>>>> Jianshi Huang
>>>>>>>>>> 
>>>>>>>>>> LinkedIn: jianshi
>>>>>>>>>> Twitter: @jshuang
>>>>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Jianshi Huang
>>>>>>>> 
>>>>>>>> LinkedIn: jianshi
>>>>>>>> Twitter: @jshuang
>>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Jianshi Huang
>>>>>> 
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Jianshi Huang
>>>> 
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Jianshi Huang
>> 
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>> 
> 
> 
> 
> -- 
> Jianshi Huang
> 
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/


Mime
View raw message