hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <lhofha...@yahoo.com>
Subject Re: Using Scans in parallel
Date Sun, 09 Oct 2011 20:13:12 GMT
Which version of HBase?
Are there concurrent inserts? If so, do you see splits in the log files happening while you
do the scanning?

I am pretty sure this has nothing to do with concurrent scans.

From: Bryan Keller <bryanck@gmail.com>
To: Bryan Keller <bryanck@gmail.com>
Cc: user@hbase.apache.org
Sent: Sunday, October 9, 2011 11:03 AM
Subject: Re: Using Scans in parallel

On further thought, it seems this might be a serious issue, as two unrelated processes within
an application may be scanning the same table at the same time.

On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote:

> I was not able to get consistent results using multiple scanners in parallel on a table.
I implemented a counter test that used 8 scanners in parallel on a table with 2m rows with
2k+ columns each, and the results were not consistent. There were no errors thrown, but the
count was off by as much as 2%. Using a single thread gave the same (correct) result every
> I tried various approaches, such as creating an HTable and opening a connection per thread,
but I was not able to get stable results. I would do some testing before using parallel scanners
as described here.
> On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote:
>> That's part of it, the other part is to get the region demarcations.
>> You can also just get the smallest and largest key of the table and pick other demarcations
for your scans. Then your individual scans will likely cover multiple regions and regionservers.
>> Your threading model depends on your needs. If you interested in lowest latency you
want to keep your regionservers busy for each query.
>> What exactly that means depends on your setup. Maybe you split up the overall scan
so that no more than N scans are active at any regionserver.
>> If you're more interested in overall predictability, you might not want parallelize
each scan too much.
>> ----- Original Message -----
>> From: Sam Seigal <selekt86@yahoo.com>
>> To: user@hbase.apache.org; lars hofhansl <lhofhansl@yahoo.com>
>> Cc: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>
>> Sent: Wednesday, October 5, 2011 6:18 PM
>> Subject: Re: Using Scans in parallel
>> So the whole point of getting the region locations is to ensure that
>> there is one thread per region server ?
>> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <lhofhansl@yahoo.com> wrote:
>>> Hi Sam,
>>> There were some attempts to build this in. In the end I think the exact patterns
are different based on what one is trying to achieve.
>>> Currently what you can do is getting all the region locations (HTable.getRegionLocations).
From the HRegionInfos you can
>>> get the regions start and end keys.
>>> Now you can issue parallel scan for as many regions as you want (by create a
Scan object with start and row set to the region's
>>> start and end key).
>>> You probably want to group the regions by regionserver and have one thread per
region server, or something.
>>> -- Lars
>>> ________________________________
>>> From: Sam Seigal <selekt86@yahoo.com>
>>> To: hbase-user@hadoop.apache.org
>>> Sent: Wednesday, October 5, 2011 4:29 PM
>>> Subject: Using Scans in parallel
>>> Hi ,
>>> Is there a known way to be able to do Scan's in parallel (in different
>>> threads even) and then sort/combine the output ?
>>> For a row key like:
>>> prefix-event_type-event_id
>>> prefix-event_type-event_id
>>> I want to declare two scan objects (for say event_id_type foo)
>>> Scan 1 =>  0-foo
>>> Scan 2 =>  1-foo
>>> execute the scans in parallel (maybe even in different threads) and
>>> then merge the results ?
>>> Thank you,
>>> Sam
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message