hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shushant Arora <shushantaror...@gmail.com>
Subject Re: hbase doubts
Date Wed, 19 Aug 2015 04:53:03 GMT
When last region gets new data and split in two - what is the split point -
say last reagion was having 10 files and split alogorithm decided to split
this region-

Will the two children regions have 5-5 files or the key space of original
region(parent region) say have range (2015-08-01#guid to 2015-08-06#guid)
will be divided to 2 equal parts child1 has (2015-08-01#guid to
2015-08-03#guids) and child2 has range (2015-08-04#guid to 2015-08-06#guid)
and all data is  rewritten in child regions to accomany this key range and
then since its time series based so new data will come in increasing dates
and for dates>2015-08-06 only so will go to child2 and child1 wil always be
half filled. And child2 only will lead to new splits when reached split
size threshold.






On Wed, Aug 19, 2015 at 4:16 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> Since year and month are part of the row key in this scenario (instead of
> just the day of month), the last region would get new data and be split.
>
> Is this effect desirable for your app ?
>
> Cheers
>
> On Tue, Aug 18, 2015 at 12:55 PM, Shushant Arora <
> shushantarora09@gmail.com>
> wrote:
>
> > for hbase key containing time as prefix say(yyyy-mm-dd#other fields of
> guid
> > base) I am using bulk load to avoid hot spot of regionserver (avoiding
> > write to WAL).
> >
> > What should be the initial splits of regions. Say I have 30 regionserves.
> >
> > shall intial 30 days as intial splits and then auto split takes care of
> > splitting regions if it grows further will serve ?
> > Or since if it has date as prefix and when region is split in 2 from
> midway
> > - and new data will come for increasing date only will lead to  one
> region
> > to be half filled always and rest half never filled?
> >
> > On Tue, Aug 18, 2015 at 9:41 PM, anil gupta <anilgupta84@gmail.com>
> wrote:
> >
> > > As per my experience, Phoenix is way superior than Hive-HBase
> integration
> > > for sql-like querying on HBase. It's because, Phoenix is built on top
> of
> > > HBase unlike Hive.
> > >
> > > On Tue, Aug 18, 2015 at 9:09 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> > >
> > > > To my knowledge, Phoenix provides better integration with hbase.
> > > >
> > > > A third possibility is Spark on HBase.
> > > >
> > > > If you want to explore these alternatives, I suggest asking on
> > respective
> > > > mailing lists where you can get expert opinions.
> > > >
> > > > Cheers
> > > >
> > > > On Tue, Aug 18, 2015 at 9:03 AM, Shushant Arora <
> > > shushantarora09@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Thanks!
> > > > >
> > > > > Which one is better for sqlkind of queries over hbase (queries
> > involve
> > > > > filter , key range scan), aggregates by column values.
> > > > > .
> > > > > 1.Hive storage handlers
> > > > > 2.or Phoenix
> > > > >
> > > > > On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu <yuzhihong@gmail.com>
> wrote:
> > > > >
> > > > > > For #1, if you want to count distinct values for F1, you can
> write
> > a
> > > > > > coprocessor which aggregates the count on region server and
> returns
> > > the
> > > > > > result to client which does the final aggregation.
> > > > > >
> > > > > > Take a look
> > > > > > at
> > > > > >
> > > > >
> > > >
> > >
> >
> hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java
> > > > > > and related classes for example.
> > > > > >
> > > > > > On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora <
> > > > > > shushantarora09@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks !
> > > > > > > few more doubts :
> > > > > > >
> > > > > > > 1.Say if requirement is to count distinct value of F1-
> > > > > > >
> > > > > > > If field is part of key- is hbase can't just scan key and
skip
> > > value
> > > > > > > deserialsation and return result to client which will calculate
> > > > > distinct
> > > > > > > and in second approcah Hbase will desrialise the value
of
> return
> > > > column
> > > > > > > containing F1 to cleint which will calculate the distinct.
> > > > > > >
> > > > > > > 2.For bulk load when LoadIncrementalHFiles runs and
> regionserver
> > > > moves
> > > > > > the
> > > > > > > hfiles from hdfs to region directory - does regionserver
> localise
> > > the
> > > > > > hfile
> > > > > > > by downloading it to local and then uploading again in
region
> > > > > directory?
> > > > > > Or
> > > > > > > it just moves to to region directory and wait for next
> compaction
> > > to
> > > > > get
> > > > > > it
> > > > > > > localise  as in regionserver failure case?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu <yuzhihong@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > For both scenarios you mentioned, field is not leading
part
> of
> > > row
> > > > > key.
> > > > > > > > You would need to specify timerange or start row /
stop row
> to
> > > > narrow
> > > > > > the
> > > > > > > > key range being scanned.
> > > > > > > >
> > > > > > > > I am leaning toward using second approach.
> > > > > > > >
> > > > > > > > Cheers
> > > > > > > >
> > > > > > > > On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora <
> > > > > > > shushantarora09@gmail.com
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > ~8-10 fields of size (5 of  20 bytes each )and
3 fields of
> > size
> > > > 200
> > > > > > > bytes
> > > > > > > > > each.
> > > > > > > > >
> > > > > > > > > On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu <
> yuzhihong@gmail.com
> > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > How many fields such as F1 are you considering
for
> > embedding
> > > in
> > > > > row
> > > > > > > > key ?
> > > > > > > > > >
> > > > > > > > > > Suggested reading:
> > > > > > > > > > http://hbase.apache.org/book.html#rowkey.design
> > > > > > > > > > http://hbase.apache.org/book.html#client.filter.kvm
(see
> > > > > > > > > > ColumnPrefixFilter)
> > > > > > > > > >
> > > > > > > > > > Cheers
> > > > > > > > > >
> > > > > > > > > > On Mon, Aug 17, 2015 at 8:13 AM, Shushant
Arora <
> > > > > > > > > shushantarora09@gmail.com
> > > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > 1.so size limit is per cell's identifier
+ value ?
> > > > > > > > > > >
> > > > > > > > > > > What is more optimise - to have field
in key or in
> column
> > > > > > family's
> > > > > > > > > > column ?
> > > > > > > > > > > If pattern is like every row has that
field.
> > > > > > > > > > >
> > > > > > > > > > > Say I have a field F1 in all rows so
> > > > > > > > > > > Situtatio -1
> > > > > > > > > > > key1#F1(as composite key)  - and rest
fields in column
> > > > > > > > > > >
> > > > > > > > > > > Situation-2
> > > > > > > > > > > key1 as key and F1 part of column family.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > This is the main reason I  asked the
key size limit.
> > > > > > > > > > > If I asked for no of rows where F1
is = 'someval' will
> it
> > > be
> > > > > > faster
> > > > > > > > in
> > > > > > > > > > > situation-1 than in situation-2. Since
in 1 it can
> return
> > > the
> > > > > > > result
> > > > > > > > > just
> > > > > > > > > > > by traversing keys no need to read
columns?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 17, 2015 at 8:27 PM, Ted
Yu <
> > > yuzhihong@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > For #1, it is the limit on a single
keyvalue, not
> row,
> > > not
> > > > > key.
> > > > > > > > > > > >
> > > > > > > > > > > > For #2, please see the following:
> > > > > > > > > > > >
> > > > > > > > > > > > http://hbase.apache.org/book.html#store.memstore
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > http://hbase.apache.org/book.html#regionserver_splitting_implementation
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Aug 17, 2015 at 7:36 AM,
Shushant Arora <
> > > > > > > > > > > shushantarora09@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > 1.Is hbase.client.keyvalue.maxsize
 is max size of
> > row
> > > or
> > > > > key
> > > > > > > > only
> > > > > > > > > ?
> > > > > > > > > > Is
> > > > > > > > > > > > > there any limit on key size
only ?
> > > > > > > > > > > > > 2.Access pattern is mostly
on key based only- Is
> > > > memstores
> > > > > > and
> > > > > > > > > > regions
> > > > > > > > > > > > on a
> > > > > > > > > > > > > regionserver are per table
basis? Is it if I have
> > > > multiple
> > > > > > > tables
> > > > > > > > > it
> > > > > > > > > > > will
> > > > > > > > > > > > > have multiple memstores instead
of few if it would
> > have
> > > > > been
> > > > > > > one
> > > > > > > > > > large
> > > > > > > > > > > > > table ?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Aug 17, 2015 at 7:29
PM, Ted Yu <
> > > > > yuzhihong@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > For #1, take a look
at the following in
> > > > > hbase-default.xml :
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >     <name>hbase.client.keyvalue.maxsize</name>
> > > > > > > > > > > > > >     <value>10485760</value>
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > For #2, it would be
easier to answer if you can
> > > outline
> > > > > > > access
> > > > > > > > > > > patterns
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > your app.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > For #3, adjustment according
to current region
> > > > boundaries
> > > > > > is
> > > > > > > > done
> > > > > > > > > > > > client
> > > > > > > > > > > > > > side. Take a look at
the javadoc for
> LoadQueueItem
> > > > > > > > > > > > > > in LoadIncrementalHFiles.java
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Cheers
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Aug 17, 2015
at 6:45 AM, Shushant Arora <
> > > > > > > > > > > > > shushantarora09@gmail.com
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1.Is there any
max limit on key size of hbase
> > > table.
> > > > > > > > > > > > > > > 2.Is multiple small
tables vs one large table
> > which
> > > > one
> > > > > > is
> > > > > > > > > > > preferred.
> > > > > > > > > > > > > > > 3.for bulk load
-when  LoadIncremantalHfile is
> > run
> > > it
> > > > > > again
> > > > > > > > > > > > > recalculates
> > > > > > > > > > > > > > > the region splits
based on region boundary - is
> > > this
> > > > > > > division
> > > > > > > > > > > happens
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > > client side or
server side again at region
> server
> > > or
> > > > > > hbase
> > > > > > > > > master
> > > > > > > > > > > and
> > > > > > > > > > > > > > then
> > > > > > > > > > > > > > > it assigns the
splits which cross target region
> > > > > boundary
> > > > > > to
> > > > > > > > > > desired
> > > > > > > > > > > > > > > regionserver.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Regards,
> > > Anil Gupta
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message