hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anoop John <anoop.hb...@gmail.com>
Subject Re: HBase region assignment by range?
Date Wed, 08 Apr 2015 18:50:28 GMT
bq.while the region can surely split when more data added-on, but can HBase
keep the new regions still on the same regionServer according to the
predefined bounary?

You need custom LB for that.. If there, it is possible to restrict

-Anoop-


On Thu, Apr 9, 2015 at 12:09 AM, Demai Ni <nidmgg@gmail.com> wrote:

> hi, Guys,
>
> many thanks for your quick response.
>
> First, Let me share what I am looking at, which may help to clarify the
> intention and answer a few of questions. I am working on a POC to bring in
> MPP style of OLAP on Hadoop, and looking for whether it is feasible to have
> HBase as Datastore. With HBase, I'd like to take advantage of 1) OLTP
> capability ; 2) many filters ; 3) in-cluster replica and between-clusters
> replication. I am currently using TPCH schema for this POC, and also
> consider star-schema. Since it is a POC, I can pretty much define my rules
> and set limitations as it fits. :-)
>
> Why doesn't this(presplit) work for you?
>
>  The reason is that presplit won't guarantee the regions stay at the
> pre-assigned regionServer. Let's say I have a very large table and a very
> small table with different data distribution, even with the same presplit
> value. HBase won't ensure the same range of data located on the same
> physical node. Unless we have a custom LB mentioned by @Anoop and @esteban.
> Is my understanding correct? BTW, I will look into HBASE-10576 to see
> whether it fits my needs.
>
> Is your table staic?
> >
> while I can make it static for POC purpose, but I will use this limitation,
> as I'd like the HBase for its OLTP feature. So besides the 'static' HFile,
> need HLOGs on the same local node too. But again, I would worry about the
> 'static' HFile for now
>
> However as you add data to the table, those regions will eventually split.
>
>  while the region can surely split when more data added-on, but can HBase
> keep the new regions still on the same regionServer according to the
> predefined bounary? I will worry about hotspot-issue late. that is the
> beauty of doing POC instead of production. :-)
>
> What you’re suggesting is that as you do a region scan, you’re going to the
> > other table and then try to fetch a row if it exists.
> >
> Yes, something like that. I am currently using the client API: scan() with
> start and end key.  Since I know my start and end keys, and with the
> local-read feature, the scan should be local-READ. With some
> statistics(such as which one is larger table) and  a hash join
> operation(which I need to implement), the join will work with not-too-bad
> performance. Again, it is POC, so I won't worry about the situation that a
> regionServer hosts too much data(hotspot). But surely, a LB should be used
> before putting into production if it ever occurs.
>
> either the second table should be part of the first table in the same CF or
> > as a separate CF
> >
> I am not sure whether it will work for a situation of a large table vs a
> small table. The data of the small table has to be duplicated in many
> places, and a update of the small table can be costly.
>
> Demai
>
>
> On Wed, Apr 8, 2015 at 10:24 AM, Esteban Gutierrez <esteban@cloudera.com>
> wrote:
>
> > +1 Anoop.
> >
> > Thats pretty much the only way right now if you need a custom balancing.
> > This balancer doesn't have to live in the HMaster and can be invoked
> > externally (there are caveats of doing that, when a RS die but works ok
> so
> > far). A long term solution for your the problem you are trying to solve
> is
> > HBASE-10576 by tweaking it a little.
> >
> > cheers,
> > esteban.
> >
> >
> >
> >
> >
> > --
> > Cloudera, Inc.
> >
> >
> > On Wed, Apr 8, 2015 at 4:41 AM, Michael Segel <michael_segel@hotmail.com
> >
> > wrote:
> >
> > > Is your table staic?
> > >
> > > If you know your data and your ranges, you can do it. However as you
> add
> > > data to the table, those regions will eventually split.
> > >
> > > The other issue that you brought up is that you want to do ‘local’
> joins.
> > >
> > > Simple single word response… don’t.
> > >
> > > Longer response..
> > >
> > > You’re suggesting that the tables in question share the row key in
> > > common.  Ok… why? Are they part of the same record?
> > > How is the data normally being used?
> > >
> > > Have you looked at column families?
> > >
> > > The issue is that joins are expensive. What you’re suggesting is that
> as
> > > you do a region scan, you’re going to the other table and then try to
> > fetch
> > > a row if it exists.
> > > So its essentially for each row in the scan, try a get() which will
> > almost
> > > double the cost of your fetch. Then you have to decide how to do it
> > > locally. Are you really going to write a coprocessor for this?  (Hint:
> If
> > > this is a common thing. Then either the second table should be part of
> > the
> > > first table in the same CF or as a separate CF. You need to rethink
> your
> > > schema.)
> > >
> > > Does this make sense?
> > >
> > > > On Apr 7, 2015, at 7:05 PM, Demai Ni <nidmgg@gmail.com> wrote:
> > > >
> > > > hi, folks,
> > > >
> > > > I have a question about region assignment and like to clarify some
> > > through.
> > > >
> > > > Let's say I have a table with rowkey as "row00000 ~ row30000" on a 4
> > node
> > > > hbase cluster, is there a way to keep data partitioned by range on
> each
> > > > node? for example:
> > > >
> > > > node1:  <=row10000
> > > > node2:  row10001~row20000
> > > > node3:  row20001~row30000
> > > > node4:  >row30000
> > > >
> > > > And even when one of the node become hotspot, the boundary won't be
> > > crossed
> > > > unless manually doing a load balancing?
> > > >
> > > > I looked at presplit: { SPLITS => ['row100','row200','row300'] } ,
> but
> > > > don't think it serves this purpose.
> > > >
> > > > BTW, a bit background. I am thinking to do a local join between two
> > > tables
> > > > if both have same rowkey, and partitioned by range (or same hash
> > > > algorithm). If I can keep the join-key on the same node(aka
> > > regionServer),
> > > > the join can be handled locally instead of broadcast to all other
> > nodes.
> > > >
> > > > Thanks for your input. A couple pointers to blog/presentation would
> be
> > > > appreciated.
> > > >
> > > > Demai
> > >
> > > The opinions expressed here are mine, while they may reflect a
> cognitive
> > > thought, that is purely accidental.
> > > Use at your own risk.
> > > Michael Segel
> > > michael_segel (AT) hotmail.com
> > >
> > >
> > >
> > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message