hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Startin <richardstar...@outlook.com>
Subject Re: Parallel Scanner
Date Mon, 20 Feb 2017 13:26:43 GMT
RegionLocator is not deprecated, hence the suggestion to use it if it's available in place
of whatever is still available on HTable for your version of HBase - it will make upgrades
easier. For instance HTable::getRegionsInRange no longer exists on the current master branch.


"I am trying to scan a region in parallel :)"


I thought you were asking about scanning many regions at the same time, not scanning a single
region in parallel? HBASE-1935 is about parallelising scans over regions, not within regions.


If you want to parallelise within a region, you could write a little method to split the first
and last key of the region into several disjoint lexicographic buckets and create a scan for
each bucket, then execute those scans in parallel. Your data probably doesn't distribute uniformly
over lexicographic buckets though so the scans are unlikely to execute at a constant rate
and you'll get results in time proportional to the lexicographic bucket with the highest cardinality
in the region. I'd be interested to know if anyone on the list has ever tried this and what
the results were?


Using the much simpler approach of parallelising over regions by creating multiple disjoint
scans client side, as suggested, your performance now depends on your regions which you have
some control over. You can achieve the same effect by pre-splitting your table such that you
empirically optimise read performance for the dataset you store.


Thanks,

Richard


________________________________
From: Anil <anilklce@gmail.com>
Sent: 20 February 2017 12:35
To: user@hbase.apache.org
Subject: Re: Parallel Scanner

Thanks Richard.

I am able to get the regions for data to be loaded from table. I am trying
to scan a region in parallel :)

Thanks

On 20 February 2017 at 16:44, Richard Startin <richardstartin@outlook.com>
wrote:

> For a client only solution, have you looked at the RegionLocator
> interface? It gives you a list of pairs of byte[] (the start and stop keys
> for each region). You can easily use a ForkJoinPool recursive task or java
> 8 parallel stream over that list. I implemented a spark RDD to do that and
> wrote about it with code samples here:
>
> https://richardstartin.com/2016/11/07/co-locating-spark-

> partitions-with-hbase-regions/
>
> Forget about the spark details in the post (and forget that Hortonworks
> have a library to do the same thing :)) the idea of creating one scan per
> region and setting scan starts and stops from the region locator would give
> you a parallel scan. Note you can also group the scans by region server.
>
> Cheers,
> Richard
> On 20 Feb 2017, at 07:33, Anil <anilklce@gmail.com<mailto:ani
> lklce@gmail.com>> wrote:
>
> Thanks Ram. I will look into EndPoints.
>
> On 20 February 2017 at 12:29, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com<mailto:ramkrishna.s.vasudevan@gmail.com>>
> wrote:
>
> Yes. There is way.
>
> Have you seen Endpoints? Endpoints are triggers like points that allows
> your client to trigger them parallely in one ore more regions using the
> start and end key of the region. This executes parallely and then you may
> have to sort out the results as per your need.
>
> But these endpoints have to running on your region servers and it is not a
> client only soln.
> https://blogs.apache.org/hbase/entry/coprocessor_introduction.
[https://blogs.apache.org/hbase/mediaresource/60b135e5-04c6-4197-b262-e7cd08de784b]<https://blogs.apache.org/hbase/entry/coprocessor_introduction>

Coprocessor Introduction : Apache HBase<https://blogs.apache.org/hbase/entry/coprocessor_introduction>
blogs.apache.org
Coprocessor Introduction. Authors: Trend Micro Hadoop Group: Mingjie Lai, Eugene Koontz, Andrew
Purtell (The original version of the blog was posted at http ...



>
> Be careful when you use them. Since these endpoints run on server ensure
> that these are not heavy or things that consume more memory which can have
> adverse effects on the server.
>
>
> Regards
> Ram
>
> On Mon, Feb 20, 2017 at 12:18 PM, Anil <anilklce@gmail.com<mailto:ani
> lklce@gmail.com>> wrote:
>
> Thanks Ram.
>
> So, you mean that there is no harm in using  HTable#getRegionsInRange in
> the application code.
>
> HTable#getRegionsInRange returned single entry for all my region start
> key
> and end key. i need to explore more on this.
>
> "If you know the table region's start and end keys you could create
> parallel scans in your application code."  - is there any way to scan a
> region in the application code other than the one i put in the original
> email ?
>
> "One thing to watch out is that if there is a split in the region then
> this start
> and end row may change so in that case it is better you try to get
> the regions every time before you issue a scan"
> - Agree. i am dynamically determining the region start key and end key
> before initiating scan operations for every initial load.
>
> Thanks.
>
>
>
>
> On 20 February 2017 at 10:59, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com<mailto:ramkrishna.s.vasudevan@gmail.com>>
> wrote:
>
> Hi Anil,
>
> HBase directly does not provide parallel scans. If you know the table
> region's start and end keys you could create parallel scans in your
> application code.
>
> In the above code snippet, the intent is right - you get the required
> regions and can issue parallel scans from your app.
>
> One thing to watch out is that if there is a split in the region then
> this
> start and end row may change so in that case it is better you try to
> get
> the regions every time before you issue a scan. Does that make sense to
> you?
>
> Regards
> Ram
>
> On Sat, Feb 18, 2017 at 1:44 PM, Anil <anilklce@gmail.com<mailto:ani
> lklce@gmail.com>> wrote:
>
> Hi ,
>
> I am building an usecase where i have to load the hbase data into
> In-memory
> database (IMDB). I am scanning the each region and loading data into
> IMDB.
>
> i am looking at parallel scanner ( https://issues.apache.org/
issues.apache.org<https://issues.apache.org/>
issues.apache.org
issues.apache.org. Apache currently hosts two different issue tracking systems, Bugzilla and
Jira. To find out how to report an issue for a particular project ...



> jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and
> HTable#
> getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
> deprecated, HBASE-1935 is still open.
>
> I see Connection from ConnectionFactory is HConnectionImplementation
> by
> default and creates HTable instance.
>
> Do you see any issues in using HTable from Table instance ?
>            for each region {
>                        int i = 0;
>                    List<HRegionLocation> regions =
> hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(),
> true);
>
>                    for (HRegionLocation region : regions){
>                    startRow = i == 0 ? scans.getStartRow() :
> region.getRegionInfo().getStartKey();
>                    i++;
>                    endRow = i == regions.size()? scans.getStopRow()
> :
> region.getRegionInfo().getEndKey();
>                     }
>           }
>
> are there any alternatives to achieve parallel scan? Thanks.
>
> Thanks
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message