hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Parallel Scanner
Date Mon, 20 Feb 2017 16:20:59 GMT
Please read https://phoenix.apache.org/update_statistics.html

FYI

On Mon, Feb 20, 2017 at 8:14 AM, Anil <anilklce@gmail.com> wrote:

> Hi Ted,
>
> its very difficult to predict the data distribution. we store parent to
> child relationships in the table. (Note : A parent record is child for
> itself )
>
> we set the max hregion file size as 10gb. I don't think we have any control
> on region size :(
>
> Thanks
>
>
> On 20 February 2017 at 21:24, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > Among the 5 columns, do you know roughly the data distribution ?
> >
> > You should put the columns whose data distribution is relatively even
> > first. Of course, there may be business requirement which you take into
> > consideration w.r.t. the composite key.
> >
> > If you cannot change the schema, do you have control over the region
> size ?
> > Smaller region may lower the variance in data distribution per region.
> >
> > On Mon, Feb 20, 2017 at 7:47 AM, Anil <anilklce@gmail.com> wrote:
> >
> > > Hi Ted,
> > >
> > > Current region size is 10 GB.
> > >
> > > Hbase row key designed like a phoenix primary key. I can say it is
> like 5
> > > column composite key. Prefix for a common set of data would have same
> > first
> > > prefix. I am not sure how to convey the data distribution.
> > >
> > > Thanks.
> > >
> > > On 20 February 2017 at 20:48, Ted Yu <yuzhihong@gmail.com> wrote:
> > >
> > > > Anil:
> > > > What's the current region size you use ?
> > > >
> > > > Given a region, do you have some idea how the data is distributed
> > within
> > > > the region ?
> > > >
> > > > Cheers
> > > >
> > > > On Mon, Feb 20, 2017 at 7:14 AM, Anil <anilklce@gmail.com> wrote:
> > > >
> > > > > i understand my original post now :)  Sorry about that.
> > > > >
> > > > > now the challenge is to split a start key and end key at client
> side
> > to
> > > > > allow parallel scans on table with no buckets, pre-salting.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > On 20 February 2017 at 20:21, ramkrishna vasudevan <
> > > > > ramkrishna.s.vasudevan@gmail.com> wrote:
> > > > >
> > > > > > You are trying to scan one region itself in parallel, then even
I
> > got
> > > > you
> > > > > > wrong. Richard's suggestion is the right choice for client only
> > soln.
> > > > > >
> > > > > > On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilklce@gmail.com>
> wrote:
> > > > > >
> > > > > > > Thanks Richard :)
> > > > > > >
> > > > > > > On 20 February 2017 at 18:56, Richard Startin <
> > > > > > richardstartin@outlook.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > RegionLocator is not deprecated, hence the suggestion
to use
> it
> > > if
> > > > > it's
> > > > > > > > available in place of whatever is still available
on HTable
> for
> > > > your
> > > > > > > > version of HBase - it will make upgrades easier. For
instance
> > > > > > > > HTable::getRegionsInRange no longer exists on the
current
> > master
> > > > > > branch.
> > > > > > > >
> > > > > > > >
> > > > > > > > "I am trying to scan a region in parallel :)"
> > > > > > > >
> > > > > > > >
> > > > > > > > I thought you were asking about scanning many regions
at the
> > same
> > > > > time,
> > > > > > > > not scanning a single region in parallel? HBASE-1935
is about
> > > > > > > parallelising
> > > > > > > > scans over regions, not within regions.
> > > > > > > >
> > > > > > > >
> > > > > > > > If you want to parallelise within a region, you could
write a
> > > > little
> > > > > > > > method to split the first and last key of the region
into
> > several
> > > > > > > disjoint
> > > > > > > > lexicographic buckets and create a scan for each bucket,
then
> > > > execute
> > > > > > > those
> > > > > > > > scans in parallel. Your data probably doesn't distribute
> > > uniformly
> > > > > over
> > > > > > > > lexicographic buckets though so the scans are unlikely
to
> > execute
> > > > at
> > > > > a
> > > > > > > > constant rate and you'll get results in time proportional
to
> > the
> > > > > > > > lexicographic bucket with the highest cardinality
in the
> > region.
> > > > I'd
> > > > > be
> > > > > > > > interested to know if anyone on the list has ever
tried this
> > and
> > > > what
> > > > > > the
> > > > > > > > results were?
> > > > > > > >
> > > > > > > >
> > > > > > > > Using the much simpler approach of parallelising over
regions
> > by
> > > > > > creating
> > > > > > > > multiple disjoint scans client side, as suggested,
your
> > > performance
> > > > > now
> > > > > > > > depends on your regions which you have some control
over. You
> > can
> > > > > > achieve
> > > > > > > > the same effect by pre-splitting your table such that
you
> > > > empirically
> > > > > > > > optimise read performance for the dataset you store.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Richard
> > > > > > > >
> > > > > > > >
> > > > > > > > ________________________________
> > > > > > > > From: Anil <anilklce@gmail.com>
> > > > > > > > Sent: 20 February 2017 12:35
> > > > > > > > To: user@hbase.apache.org
> > > > > > > > Subject: Re: Parallel Scanner
> > > > > > > >
> > > > > > > > Thanks Richard.
> > > > > > > >
> > > > > > > > I am able to get the regions for data to be loaded
from
> table.
> > I
> > > am
> > > > > > > trying
> > > > > > > > to scan a region in parallel :)
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > On 20 February 2017 at 16:44, Richard Startin <
> > > > > > > richardstartin@outlook.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > For a client only solution, have you looked at
the
> > > RegionLocator
> > > > > > > > > interface? It gives you a list of pairs of byte[]
(the
> start
> > > and
> > > > > stop
> > > > > > > > keys
> > > > > > > > > for each region). You can easily use a ForkJoinPool
> recursive
> > > > task
> > > > > or
> > > > > > > > java
> > > > > > > > > 8 parallel stream over that list. I implemented
a spark RDD
> > to
> > > do
> > > > > > that
> > > > > > > > and
> > > > > > > > > wrote about it with code samples here:
> > > > > > > > >
> > > > > > > > > https://richardstartin.com/2016/11/07/co-locating-spark-
> > > > > > > >
> > > > > > > > > partitions-with-hbase-regions/
> > > > > > > > >
> > > > > > > > > Forget about the spark details in the post (and
forget that
> > > > > > Hortonworks
> > > > > > > > > have a library to do the same thing :)) the idea
of
> creating
> > > one
> > > > > scan
> > > > > > > per
> > > > > > > > > region and setting scan starts and stops from
the region
> > > locator
> > > > > > would
> > > > > > > > give
> > > > > > > > > you a parallel scan. Note you can also group
the scans by
> > > region
> > > > > > > server.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Richard
> > > > > > > > > On 20 Feb 2017, at 07:33, Anil <anilklce@gmail.com<mailto:
> > ani
> > > > > > > > > lklce@gmail.com>> wrote:
> > > > > > > > >
> > > > > > > > > Thanks Ram. I will look into EndPoints.
> > > > > > > > >
> > > > > > > > > On 20 February 2017 at 12:29, ramkrishna vasudevan
<
> > > > > > > > > ramkrishna.s.vasudevan@gmail.com<mailto:ramkrishna.s.
> > > > > > > vasudevan@gmail.com
> > > > > > > > >>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Yes. There is way.
> > > > > > > > >
> > > > > > > > > Have you seen Endpoints? Endpoints are triggers
like points
> > > that
> > > > > > allows
> > > > > > > > > your client to trigger them parallely in one
ore more
> regions
> > > > using
> > > > > > the
> > > > > > > > > start and end key of the region. This executes
parallely
> and
> > > then
> > > > > you
> > > > > > > may
> > > > > > > > > have to sort out the results as per your need.
> > > > > > > > >
> > > > > > > > > But these endpoints have to running on your region
servers
> > and
> > > it
> > > > > is
> > > > > > > not
> > > > > > > > a
> > > > > > > > > client only soln.
> > > > > > > > > https://blogs.apache.org/hbase/entry/coprocessor_
> > introduction.
> > > > > > > > [https://blogs.apache.org/hbase/mediaresource/60b135e5-
> > > > > > > > 04c6-4197-b262-e7cd08de784b]<https://blogs.apache.org/hbase/
> > > > > > > > entry/coprocessor_introduction>
> > > > > > > >
> > > > > > > > Coprocessor Introduction : Apache HBase<https://blogs.apache
> .
> > > > > > > > org/hbase/entry/coprocessor_introduction>
> > > > > > > > blogs.apache.org
> > > > > > > > Coprocessor Introduction. Authors: Trend Micro Hadoop
Group:
> > > > Mingjie
> > > > > > Lai,
> > > > > > > > Eugene Koontz, Andrew Purtell (The original version
of the
> blog
> > > was
> > > > > > > posted
> > > > > > > > at http ...
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Be careful when you use them. Since these endpoints
run on
> > > server
> > > > > > > ensure
> > > > > > > > > that these are not heavy or things that consume
more memory
> > > which
> > > > > can
> > > > > > > > have
> > > > > > > > > adverse effects on the server.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Regards
> > > > > > > > > Ram
> > > > > > > > >
> > > > > > > > > On Mon, Feb 20, 2017 at 12:18 PM, Anil <anilklce@gmail.com
> > > > <mailto:
> > > > > > ani
> > > > > > > > > lklce@gmail.com>> wrote:
> > > > > > > > >
> > > > > > > > > Thanks Ram.
> > > > > > > > >
> > > > > > > > > So, you mean that there is no harm in using
> > > > > HTable#getRegionsInRange
> > > > > > > in
> > > > > > > > > the application code.
> > > > > > > > >
> > > > > > > > > HTable#getRegionsInRange returned single entry
for all my
> > > region
> > > > > > start
> > > > > > > > > key
> > > > > > > > > and end key. i need to explore more on this.
> > > > > > > > >
> > > > > > > > > "If you know the table region's start and end
keys you
> could
> > > > create
> > > > > > > > > parallel scans in your application code."  -
is there any
> way
> > > to
> > > > > > scan a
> > > > > > > > > region in the application code other than the
one i put in
> > the
> > > > > > original
> > > > > > > > > email ?
> > > > > > > > >
> > > > > > > > > "One thing to watch out is that if there is a
split in the
> > > region
> > > > > > then
> > > > > > > > > this start
> > > > > > > > > and end row may change so in that case it is
better you try
> > to
> > > > get
> > > > > > > > > the regions every time before you issue a scan"
> > > > > > > > > - Agree. i am dynamically determining the region
start key
> > and
> > > > end
> > > > > > key
> > > > > > > > > before initiating scan operations for every initial
load.
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 20 February 2017 at 10:59, ramkrishna vasudevan
<
> > > > > > > > > ramkrishna.s.vasudevan@gmail.com<mailto:ramkrishna.s.
> > > > > > > vasudevan@gmail.com
> > > > > > > > >>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Anil,
> > > > > > > > >
> > > > > > > > > HBase directly does not provide parallel scans.
If you know
> > the
> > > > > table
> > > > > > > > > region's start and end keys you could create
parallel scans
> > in
> > > > your
> > > > > > > > > application code.
> > > > > > > > >
> > > > > > > > > In the above code snippet, the intent is right
- you get
> the
> > > > > required
> > > > > > > > > regions and can issue parallel scans from your
app.
> > > > > > > > >
> > > > > > > > > One thing to watch out is that if there is a
split in the
> > > region
> > > > > then
> > > > > > > > > this
> > > > > > > > > start and end row may change so in that case
it is better
> you
> > > try
> > > > > to
> > > > > > > > > get
> > > > > > > > > the regions every time before you issue a scan.
Does that
> > make
> > > > > sense
> > > > > > to
> > > > > > > > > you?
> > > > > > > > >
> > > > > > > > > Regards
> > > > > > > > > Ram
> > > > > > > > >
> > > > > > > > > On Sat, Feb 18, 2017 at 1:44 PM, Anil <anilklce@gmail.com
> > > > <mailto:
> > > > > ani
> > > > > > > > > lklce@gmail.com>> wrote:
> > > > > > > > >
> > > > > > > > > Hi ,
> > > > > > > > >
> > > > > > > > > I am building an usecase where i have to load
the hbase
> data
> > > into
> > > > > > > > > In-memory
> > > > > > > > > database (IMDB). I am scanning the each region
and loading
> > data
> > > > > into
> > > > > > > > > IMDB.
> > > > > > > > >
> > > > > > > > > i am looking at parallel scanner (
> > https://issues.apache.org/
> > > > > > > > issues.apache.org<https://issues.apache.org/>
> > > > > > > > issues.apache.org
> > > > > > > > issues.apache.org. Apache currently hosts two different
> issue
> > > > > tracking
> > > > > > > > systems, Bugzilla and Jira. To find out how to report
an
> issue
> > > for
> > > > a
> > > > > > > > particular project ...
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > jira/browse/HBASE-8504, HBASE-1935 ) to reduce
the load
> time
> > > and
> > > > > > > > > HTable#
> > > > > > > > > getRegionsInRange(byte[] startKey, byte[] endKey,
boolean
> > > reload)
> > > > > is
> > > > > > > > > deprecated, HBASE-1935 is still open.
> > > > > > > > >
> > > > > > > > > I see Connection from ConnectionFactory is
> > > > > HConnectionImplementation
> > > > > > > > > by
> > > > > > > > > default and creates HTable instance.
> > > > > > > > >
> > > > > > > > > Do you see any issues in using HTable from Table
instance ?
> > > > > > > > >            for each region {
> > > > > > > > >                        int i = 0;
> > > > > > > > >                    List<HRegionLocation>
regions =
> > > > > > > > > hTable.getRegionsInRange(scans.getStartRow(),
> > > > scans.getStopRow(),
> > > > > > > > > true);
> > > > > > > > >
> > > > > > > > >                    for (HRegionLocation region
: regions){
> > > > > > > > >                    startRow = i == 0 ? scans.getStartRow()
> :
> > > > > > > > > region.getRegionInfo().getStartKey();
> > > > > > > > >                    i++;
> > > > > > > > >                    endRow = i == regions.size()?
> > > > scans.getStopRow()
> > > > > > > > > :
> > > > > > > > > region.getRegionInfo().getEndKey();
> > > > > > > > >                     }
> > > > > > > > >           }
> > > > > > > > >
> > > > > > > > > are there any alternatives to achieve parallel
scan?
> Thanks.
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message