hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: hbase doubts
Date Wed, 19 Aug 2015 16:21:22 GMT
Please read the following w.r.t. region splits:

http://hbase.apache.org/book.html#arch.region.splits (there is link to blog
with details)
http://hbase.apache.org/book.html#manual_region_splitting_decisions

FYI

On Tue, Aug 18, 2015 at 9:53 PM, Shushant Arora <shushantarora09@gmail.com>
wrote:

> When last region gets new data and split in two - what is the split point -
> say last reagion was having 10 files and split alogorithm decided to split
> this region-
>
> Will the two children regions have 5-5 files or the key space of original
> region(parent region) say have range (2015-08-01#guid to 2015-08-06#guid)
> will be divided to 2 equal parts child1 has (2015-08-01#guid to
> 2015-08-03#guids) and child2 has range (2015-08-04#guid to 2015-08-06#guid)
> and all data is  rewritten in child regions to accomany this key range and
> then since its time series based so new data will come in increasing dates
> and for dates>2015-08-06 only so will go to child2 and child1 wil always be
> half filled. And child2 only will lead to new splits when reached split
> size threshold.
>
>
>
>
>
>
> On Wed, Aug 19, 2015 at 4:16 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > Since year and month are part of the row key in this scenario (instead of
> > just the day of month), the last region would get new data and be split.
> >
> > Is this effect desirable for your app ?
> >
> > Cheers
> >
> > On Tue, Aug 18, 2015 at 12:55 PM, Shushant Arora <
> > shushantarora09@gmail.com>
> > wrote:
> >
> > > for hbase key containing time as prefix say(yyyy-mm-dd#other fields of
> > guid
> > > base) I am using bulk load to avoid hot spot of regionserver (avoiding
> > > write to WAL).
> > >
> > > What should be the initial splits of regions. Say I have 30
> regionserves.
> > >
> > > shall intial 30 days as intial splits and then auto split takes care of
> > > splitting regions if it grows further will serve ?
> > > Or since if it has date as prefix and when region is split in 2 from
> > midway
> > > - and new data will come for increasing date only will lead to  one
> > region
> > > to be half filled always and rest half never filled?
> > >
> > > On Tue, Aug 18, 2015 at 9:41 PM, anil gupta <anilgupta84@gmail.com>
> > wrote:
> > >
> > > > As per my experience, Phoenix is way superior than Hive-HBase
> > integration
> > > > for sql-like querying on HBase. It's because, Phoenix is built on top
> > of
> > > > HBase unlike Hive.
> > > >
> > > > On Tue, Aug 18, 2015 at 9:09 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> > > >
> > > > > To my knowledge, Phoenix provides better integration with hbase.
> > > > >
> > > > > A third possibility is Spark on HBase.
> > > > >
> > > > > If you want to explore these alternatives, I suggest asking on
> > > respective
> > > > > mailing lists where you can get expert opinions.
> > > > >
> > > > > Cheers
> > > > >
> > > > > On Tue, Aug 18, 2015 at 9:03 AM, Shushant Arora <
> > > > shushantarora09@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > Which one is better for sqlkind of queries over hbase (queries
> > > involve
> > > > > > filter , key range scan), aggregates by column values.
> > > > > > .
> > > > > > 1.Hive storage handlers
> > > > > > 2.or Phoenix
> > > > > >
> > > > > > On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu <yuzhihong@gmail.com>
> > wrote:
> > > > > >
> > > > > > > For #1, if you want to count distinct values for F1, you
can
> > write
> > > a
> > > > > > > coprocessor which aggregates the count on region server
and
> > returns
> > > > the
> > > > > > > result to client which does the final aggregation.
> > > > > > >
> > > > > > > Take a look
> > > > > > > at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java
> > > > > > > and related classes for example.
> > > > > > >
> > > > > > > On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora <
> > > > > > > shushantarora09@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks !
> > > > > > > > few more doubts :
> > > > > > > >
> > > > > > > > 1.Say if requirement is to count distinct value of
F1-
> > > > > > > >
> > > > > > > > If field is part of key- is hbase can't just scan
key and
> skip
> > > > value
> > > > > > > > deserialsation and return result to client which will
> calculate
> > > > > > distinct
> > > > > > > > and in second approcah Hbase will desrialise the value
of
> > return
> > > > > column
> > > > > > > > containing F1 to cleint which will calculate the distinct.
> > > > > > > >
> > > > > > > > 2.For bulk load when LoadIncrementalHFiles runs and
> > regionserver
> > > > > moves
> > > > > > > the
> > > > > > > > hfiles from hdfs to region directory - does regionserver
> > localise
> > > > the
> > > > > > > hfile
> > > > > > > > by downloading it to local and then uploading again
in region
> > > > > > directory?
> > > > > > > Or
> > > > > > > > it just moves to to region directory and wait for
next
> > compaction
> > > > to
> > > > > > get
> > > > > > > it
> > > > > > > > localise  as in regionserver failure case?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu <
> yuzhihong@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > For both scenarios you mentioned, field is not
leading part
> > of
> > > > row
> > > > > > key.
> > > > > > > > > You would need to specify timerange or start
row / stop row
> > to
> > > > > narrow
> > > > > > > the
> > > > > > > > > key range being scanned.
> > > > > > > > >
> > > > > > > > > I am leaning toward using second approach.
> > > > > > > > >
> > > > > > > > > Cheers
> > > > > > > > >
> > > > > > > > > On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora
<
> > > > > > > > shushantarora09@gmail.com
> > > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > ~8-10 fields of size (5 of  20 bytes each
)and 3 fields
> of
> > > size
> > > > > 200
> > > > > > > > bytes
> > > > > > > > > > each.
> > > > > > > > > >
> > > > > > > > > > On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu
<
> > yuzhihong@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > How many fields such as F1 are you
considering for
> > > embedding
> > > > in
> > > > > > row
> > > > > > > > > key ?
> > > > > > > > > > >
> > > > > > > > > > > Suggested reading:
> > > > > > > > > > > http://hbase.apache.org/book.html#rowkey.design
> > > > > > > > > > > http://hbase.apache.org/book.html#client.filter.kvm
> (see
> > > > > > > > > > > ColumnPrefixFilter)
> > > > > > > > > > >
> > > > > > > > > > > Cheers
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 17, 2015 at 8:13 AM, Shushant
Arora <
> > > > > > > > > > shushantarora09@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > 1.so size limit is per cell's
identifier + value ?
> > > > > > > > > > > >
> > > > > > > > > > > > What is more optimise - to have
field in key or in
> > column
> > > > > > > family's
> > > > > > > > > > > column ?
> > > > > > > > > > > > If pattern is like every row has
that field.
> > > > > > > > > > > >
> > > > > > > > > > > > Say I have a field F1 in all rows
so
> > > > > > > > > > > > Situtatio -1
> > > > > > > > > > > > key1#F1(as composite key)  - and
rest fields in
> column
> > > > > > > > > > > >
> > > > > > > > > > > > Situation-2
> > > > > > > > > > > > key1 as key and F1 part of column
family.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > This is the main reason I  asked
the key size limit.
> > > > > > > > > > > > If I asked for no of rows where
F1 is = 'someval'
> will
> > it
> > > > be
> > > > > > > faster
> > > > > > > > > in
> > > > > > > > > > > > situation-1 than in situation-2.
Since in 1 it can
> > return
> > > > the
> > > > > > > > result
> > > > > > > > > > just
> > > > > > > > > > > > by traversing keys no need to
read columns?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Aug 17, 2015 at 8:27 PM,
Ted Yu <
> > > > yuzhihong@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > For #1, it is the limit on
a single keyvalue, not
> > row,
> > > > not
> > > > > > key.
> > > > > > > > > > > > >
> > > > > > > > > > > > > For #2, please see the following:
> > > > > > > > > > > > >
> > > > > > > > > > > > > http://hbase.apache.org/book.html#store.memstore
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> http://hbase.apache.org/book.html#regionserver_splitting_implementation
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Aug 17, 2015 at 7:36
AM, Shushant Arora <
> > > > > > > > > > > > shushantarora09@gmail.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 1.Is hbase.client.keyvalue.maxsize
 is max size
> of
> > > row
> > > > or
> > > > > > key
> > > > > > > > > only
> > > > > > > > > > ?
> > > > > > > > > > > Is
> > > > > > > > > > > > > > there any limit on key
size only ?
> > > > > > > > > > > > > > 2.Access pattern is
mostly on key based only- Is
> > > > > memstores
> > > > > > > and
> > > > > > > > > > > regions
> > > > > > > > > > > > > on a
> > > > > > > > > > > > > > regionserver are per
table basis? Is it if I have
> > > > > multiple
> > > > > > > > tables
> > > > > > > > > > it
> > > > > > > > > > > > will
> > > > > > > > > > > > > > have multiple memstores
instead of few if it
> would
> > > have
> > > > > > been
> > > > > > > > one
> > > > > > > > > > > large
> > > > > > > > > > > > > > table ?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Aug 17, 2015
at 7:29 PM, Ted Yu <
> > > > > > yuzhihong@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > For #1, take a
look at the following in
> > > > > > hbase-default.xml :
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >     <name>hbase.client.keyvalue.maxsize</name>
> > > > > > > > > > > > > > >     <value>10485760</value>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > For #2, it would
be easier to answer if you can
> > > > outline
> > > > > > > > access
> > > > > > > > > > > > patterns
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > your app.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > For #3, adjustment
according to current region
> > > > > boundaries
> > > > > > > is
> > > > > > > > > done
> > > > > > > > > > > > > client
> > > > > > > > > > > > > > > side. Take a look
at the javadoc for
> > LoadQueueItem
> > > > > > > > > > > > > > > in LoadIncrementalHFiles.java
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Cheers
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Aug 17,
2015 at 6:45 AM, Shushant
> Arora <
> > > > > > > > > > > > > > shushantarora09@gmail.com
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1.Is there
any max limit on key size of hbase
> > > > table.
> > > > > > > > > > > > > > > > 2.Is multiple
small tables vs one large table
> > > which
> > > > > one
> > > > > > > is
> > > > > > > > > > > > preferred.
> > > > > > > > > > > > > > > > 3.for bulk
load -when  LoadIncremantalHfile
> is
> > > run
> > > > it
> > > > > > > again
> > > > > > > > > > > > > > recalculates
> > > > > > > > > > > > > > > > the region
splits based on region boundary -
> is
> > > > this
> > > > > > > > division
> > > > > > > > > > > > happens
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > client side
or server side again at region
> > server
> > > > or
> > > > > > > hbase
> > > > > > > > > > master
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > then
> > > > > > > > > > > > > > > > it assigns
the splits which cross target
> region
> > > > > > boundary
> > > > > > > to
> > > > > > > > > > > desired
> > > > > > > > > > > > > > > > regionserver.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Regards,
> > > > Anil Gupta
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message