hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ramkrishna vasudevan <ramkrishna.s.vasude...@gmail.com>
Subject Re: Parallel Scanner
Date Mon, 20 Feb 2017 14:51:06 GMT
You are trying to scan one region itself in parallel, then even I got you
wrong. Richard's suggestion is the right choice for client only soln.

On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilklce@gmail.com> wrote:

> Thanks Richard :)
>
> On 20 February 2017 at 18:56, Richard Startin <richardstartin@outlook.com>
> wrote:
>
> > RegionLocator is not deprecated, hence the suggestion to use it if it's
> > available in place of whatever is still available on HTable for your
> > version of HBase - it will make upgrades easier. For instance
> > HTable::getRegionsInRange no longer exists on the current master branch.
> >
> >
> > "I am trying to scan a region in parallel :)"
> >
> >
> > I thought you were asking about scanning many regions at the same time,
> > not scanning a single region in parallel? HBASE-1935 is about
> parallelising
> > scans over regions, not within regions.
> >
> >
> > If you want to parallelise within a region, you could write a little
> > method to split the first and last key of the region into several
> disjoint
> > lexicographic buckets and create a scan for each bucket, then execute
> those
> > scans in parallel. Your data probably doesn't distribute uniformly over
> > lexicographic buckets though so the scans are unlikely to execute at a
> > constant rate and you'll get results in time proportional to the
> > lexicographic bucket with the highest cardinality in the region. I'd be
> > interested to know if anyone on the list has ever tried this and what the
> > results were?
> >
> >
> > Using the much simpler approach of parallelising over regions by creating
> > multiple disjoint scans client side, as suggested, your performance now
> > depends on your regions which you have some control over. You can achieve
> > the same effect by pre-splitting your table such that you empirically
> > optimise read performance for the dataset you store.
> >
> >
> > Thanks,
> >
> > Richard
> >
> >
> > ________________________________
> > From: Anil <anilklce@gmail.com>
> > Sent: 20 February 2017 12:35
> > To: user@hbase.apache.org
> > Subject: Re: Parallel Scanner
> >
> > Thanks Richard.
> >
> > I am able to get the regions for data to be loaded from table. I am
> trying
> > to scan a region in parallel :)
> >
> > Thanks
> >
> > On 20 February 2017 at 16:44, Richard Startin <
> richardstartin@outlook.com>
> > wrote:
> >
> > > For a client only solution, have you looked at the RegionLocator
> > > interface? It gives you a list of pairs of byte[] (the start and stop
> > keys
> > > for each region). You can easily use a ForkJoinPool recursive task or
> > java
> > > 8 parallel stream over that list. I implemented a spark RDD to do that
> > and
> > > wrote about it with code samples here:
> > >
> > > https://richardstartin.com/2016/11/07/co-locating-spark-
> >
> > > partitions-with-hbase-regions/
> > >
> > > Forget about the spark details in the post (and forget that Hortonworks
> > > have a library to do the same thing :)) the idea of creating one scan
> per
> > > region and setting scan starts and stops from the region locator would
> > give
> > > you a parallel scan. Note you can also group the scans by region
> server.
> > >
> > > Cheers,
> > > Richard
> > > On 20 Feb 2017, at 07:33, Anil <anilklce@gmail.com<mailto:ani
> > > lklce@gmail.com>> wrote:
> > >
> > > Thanks Ram. I will look into EndPoints.
> > >
> > > On 20 February 2017 at 12:29, ramkrishna vasudevan <
> > > ramkrishna.s.vasudevan@gmail.com<mailto:ramkrishna.s.
> vasudevan@gmail.com
> > >>
> > > wrote:
> > >
> > > Yes. There is way.
> > >
> > > Have you seen Endpoints? Endpoints are triggers like points that allows
> > > your client to trigger them parallely in one ore more regions using the
> > > start and end key of the region. This executes parallely and then you
> may
> > > have to sort out the results as per your need.
> > >
> > > But these endpoints have to running on your region servers and it is
> not
> > a
> > > client only soln.
> > > https://blogs.apache.org/hbase/entry/coprocessor_introduction.
> > [https://blogs.apache.org/hbase/mediaresource/60b135e5-
> > 04c6-4197-b262-e7cd08de784b]<https://blogs.apache.org/hbase/
> > entry/coprocessor_introduction>
> >
> > Coprocessor Introduction : Apache HBase<https://blogs.apache.
> > org/hbase/entry/coprocessor_introduction>
> > blogs.apache.org
> > Coprocessor Introduction. Authors: Trend Micro Hadoop Group: Mingjie Lai,
> > Eugene Koontz, Andrew Purtell (The original version of the blog was
> posted
> > at http ...
> >
> >
> >
> > >
> > > Be careful when you use them. Since these endpoints run on server
> ensure
> > > that these are not heavy or things that consume more memory which can
> > have
> > > adverse effects on the server.
> > >
> > >
> > > Regards
> > > Ram
> > >
> > > On Mon, Feb 20, 2017 at 12:18 PM, Anil <anilklce@gmail.com<mailto:ani
> > > lklce@gmail.com>> wrote:
> > >
> > > Thanks Ram.
> > >
> > > So, you mean that there is no harm in using  HTable#getRegionsInRange
> in
> > > the application code.
> > >
> > > HTable#getRegionsInRange returned single entry for all my region start
> > > key
> > > and end key. i need to explore more on this.
> > >
> > > "If you know the table region's start and end keys you could create
> > > parallel scans in your application code."  - is there any way to scan a
> > > region in the application code other than the one i put in the original
> > > email ?
> > >
> > > "One thing to watch out is that if there is a split in the region then
> > > this start
> > > and end row may change so in that case it is better you try to get
> > > the regions every time before you issue a scan"
> > > - Agree. i am dynamically determining the region start key and end key
> > > before initiating scan operations for every initial load.
> > >
> > > Thanks.
> > >
> > >
> > >
> > >
> > > On 20 February 2017 at 10:59, ramkrishna vasudevan <
> > > ramkrishna.s.vasudevan@gmail.com<mailto:ramkrishna.s.
> vasudevan@gmail.com
> > >>
> > > wrote:
> > >
> > > Hi Anil,
> > >
> > > HBase directly does not provide parallel scans. If you know the table
> > > region's start and end keys you could create parallel scans in your
> > > application code.
> > >
> > > In the above code snippet, the intent is right - you get the required
> > > regions and can issue parallel scans from your app.
> > >
> > > One thing to watch out is that if there is a split in the region then
> > > this
> > > start and end row may change so in that case it is better you try to
> > > get
> > > the regions every time before you issue a scan. Does that make sense to
> > > you?
> > >
> > > Regards
> > > Ram
> > >
> > > On Sat, Feb 18, 2017 at 1:44 PM, Anil <anilklce@gmail.com<mailto:ani
> > > lklce@gmail.com>> wrote:
> > >
> > > Hi ,
> > >
> > > I am building an usecase where i have to load the hbase data into
> > > In-memory
> > > database (IMDB). I am scanning the each region and loading data into
> > > IMDB.
> > >
> > > i am looking at parallel scanner ( https://issues.apache.org/
> > issues.apache.org<https://issues.apache.org/>
> > issues.apache.org
> > issues.apache.org. Apache currently hosts two different issue tracking
> > systems, Bugzilla and Jira. To find out how to report an issue for a
> > particular project ...
> >
> >
> >
> > > jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and
> > > HTable#
> > > getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
> > > deprecated, HBASE-1935 is still open.
> > >
> > > I see Connection from ConnectionFactory is HConnectionImplementation
> > > by
> > > default and creates HTable instance.
> > >
> > > Do you see any issues in using HTable from Table instance ?
> > >            for each region {
> > >                        int i = 0;
> > >                    List<HRegionLocation> regions =
> > > hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(),
> > > true);
> > >
> > >                    for (HRegionLocation region : regions){
> > >                    startRow = i == 0 ? scans.getStartRow() :
> > > region.getRegionInfo().getStartKey();
> > >                    i++;
> > >                    endRow = i == regions.size()? scans.getStopRow()
> > > :
> > > region.getRegionInfo().getEndKey();
> > >                     }
> > >           }
> > >
> > > are there any alternatives to achieve parallel scan? Thanks.
> > >
> > > Thanks
> > >
> > >
> > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message