hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Himanshu Vashishtha <hvash...@cs.ualberta.ca>
Subject Re: Using Scans in parallel
Date Mon, 10 Oct 2011 02:00:56 GMT
Interesting.

Hey Bryan, can you please share the stats about: how many Regions, how
many Region Servers, time taken by Serial scanner and with 8 parallel
scanners.

Himanshu

On Sun, Oct 9, 2011 at 6:49 PM, Bryan Keller <bryanck@gmail.com> wrote:
> This is 100% reproducible for me, so I doubt it is related to random number generation.
>
> On Oct 9, 2011, at 2:53 PM, lars hofhansl wrote:
>
>> How frequently does this happen?
>> I did notice a while ago in the code that scanner ids are drawn just from a Random
number generator.
>>
>> So in theory it would be possible that multiple concurrent scans draw the same scanner
id.
>>
>> Since these are longs, this is astronomically unlikely, though (picking the same
number of 2^64, just does not happen :) ).
>>
>>
>>
>> ________________________________
>> From: Bryan Keller <bryanck@gmail.com>
>> To: user@hbase.apache.org
>> Sent: Sunday, October 9, 2011 2:40 PM
>> Subject: Re: Using Scans in parallel
>>
>> This is just scanning (reads). I'll need to do more testing to find a cause, hopefully
it is something with my test.
>>
>> On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote:
>>
>>> Which version of HBase?
>>> Are there concurrent inserts? If so, do you see splits in the log files happening
while you do the scanning?
>>>
>>> I am pretty sure this has nothing to do with concurrent scans.
>>>
>>> From: Bryan Keller <bryanck@gmail.com>
>>> To: Bryan Keller <bryanck@gmail.com>
>>> Cc: user@hbase.apache.org
>>> Sent: Sunday, October 9, 2011 11:03 AM
>>> Subject: Re: Using Scans in parallel
>>>
>>> On further thought, it seems this might be a serious issue, as two unrelated
processes within an application may be scanning the same table at the same time.
>>>
>>> On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote:
>>>
>>>> I was not able to get consistent results using multiple scanners in parallel
on a table. I implemented a counter test that used 8 scanners in parallel on a table with
2m rows with 2k+ columns each, and the results were not consistent. There were no errors thrown,
but the count was off by as much as 2%. Using a single thread gave the same (correct) result
every run.
>>>>
>>>> I tried various approaches, such as creating an HTable and opening a connection
per thread, but I was not able to get stable results. I would do some testing before using
parallel scanners as described here.
>>>>
>>>>
>>>> On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote:
>>>>
>>>>> That's part of it, the other part is to get the region demarcations.
>>>>> You can also just get the smallest and largest key of the table and pick
other demarcations for your scans. Then your individual scans will likely cover multiple regions
and regionservers.
>>>>>
>>>>>
>>>>> Your threading model depends on your needs. If you interested in lowest
latency you want to keep your regionservers busy for each query.
>>>>> What exactly that means depends on your setup. Maybe you split up the
overall scan so that no more than N scans are active at any regionserver.
>>>>>
>>>>> If you're more interested in overall predictability, you might not want
parallelize each scan too much.
>>>>>
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>> From: Sam Seigal <selekt86@yahoo.com>
>>>>> To: user@hbase.apache.org; lars hofhansl <lhofhansl@yahoo.com>
>>>>> Cc: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>
>>>>> Sent: Wednesday, October 5, 2011 6:18 PM
>>>>> Subject: Re: Using Scans in parallel
>>>>>
>>>>> So the whole point of getting the region locations is to ensure that
>>>>> there is one thread per region server ?
>>>>>
>>>>>
>>>>> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <lhofhansl@yahoo.com>
wrote:
>>>>>> Hi Sam,
>>>>>>
>>>>>>
>>>>>> There were some attempts to build this in. In the end I think the
exact patterns are different based on what one is trying to achieve.
>>>>>> Currently what you can do is getting all the region locations (HTable.getRegionLocations).
From the HRegionInfos you can
>>>>>> get the regions start and end keys.
>>>>>> Now you can issue parallel scan for as many regions as you want (by
create a Scan object with start and row set to the region's
>>>>>> start and end key).
>>>>>> You probably want to group the regions by regionserver and have one
thread per region server, or something.
>>>>>>
>>>>>>
>>>>>> -- Lars
>>>>>> ________________________________
>>>>>> From: Sam Seigal <selekt86@yahoo.com>
>>>>>> To: hbase-user@hadoop.apache.org
>>>>>> Sent: Wednesday, October 5, 2011 4:29 PM
>>>>>> Subject: Using Scans in parallel
>>>>>>
>>>>>> Hi ,
>>>>>>
>>>>>> Is there a known way to be able to do Scan's in parallel (in different
>>>>>> threads even) and then sort/combine the output ?
>>>>>>
>>>>>> For a row key like:
>>>>>>
>>>>>> prefix-event_type-event_id
>>>>>> prefix-event_type-event_id
>>>>>>
>>>>>> I want to declare two scan objects (for say event_id_type foo)
>>>>>>
>>>>>> Scan 1 =>  0-foo
>>>>>> Scan 2 =>  1-foo
>>>>>>
>>>>>> execute the scans in parallel (maybe even in different threads) and
>>>>>> then merge the results ?
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> Sam
>>>>>>
>>>>>
>>>>
>>>
>>>
>
>

Mime
View raw message