hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: HBase region assignment by range?
Date Wed, 08 Apr 2015 11:41:26 GMT
Is your table staic? 

If you know your data and your ranges, you can do it. However as you add data to the table,
those regions will eventually split. 

The other issue that you brought up is that you want to do ‘local’ joins.

Simple single word response… don’t. 

Longer response.. 

You’re suggesting that the tables in question share the row key in common.  Ok… why? Are
they part of the same record? 
How is the data normally being used?  

Have you looked at column families?

The issue is that joins are expensive. What you’re suggesting is that as you do a region
scan, you’re going to the other table and then try to fetch a row if it exists. 
So its essentially for each row in the scan, try a get() which will almost double the cost
of your fetch. Then you have to decide how to do it locally. Are you really going to write
a coprocessor for this?  (Hint: If this is a common thing. Then either the second table should
be part of the first table in the same CF or as a separate CF. You need to rethink your schema.)

Does this make sense? 

> On Apr 7, 2015, at 7:05 PM, Demai Ni <nidmgg@gmail.com> wrote:
> hi, folks,
> I have a question about region assignment and like to clarify some through.
> Let's say I have a table with rowkey as "row00000 ~ row30000" on a 4 node
> hbase cluster, is there a way to keep data partitioned by range on each
> node? for example:
> node1:  <=row10000
> node2:  row10001~row20000
> node3:  row20001~row30000
> node4:  >row30000
> And even when one of the node become hotspot, the boundary won't be crossed
> unless manually doing a load balancing?
> I looked at presplit: { SPLITS => ['row100','row200','row300'] } , but
> don't think it serves this purpose.
> BTW, a bit background. I am thinking to do a local join between two tables
> if both have same rowkey, and partitioned by range (or same hash
> algorithm). If I can keep the join-key on the same node(aka regionServer),
> the join can be handled locally instead of broadcast to all other nodes.
> Thanks for your input. A couple pointers to blog/presentation would be
> appreciated.
> Demai

The opinions expressed here are mine, while they may reflect a cognitive thought, that is
purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

View raw message