hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Scan vs Parallel scan.
Date Fri, 12 Sep 2014 07:26:42 GMT
Lets take a step back…. 

Your parallel scan is having the client create N threads where in each thread, you’re doing
a partial scan of the table where each partial scan takes the first and last row of each region?


Is that correct? 

On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <konstt2000@gmail.com> wrote:

> I was checking a little bit more about,, I checked the cluster and data is
> store in three different regions servers, each one in a differente node.
> So, I guess the threads go to different hard-disks.
> 
> If someone has an idea or suggestion.. why it's faster a single scan than
> this implementation. I based on this implementation
> https://github.com/zygm0nt/hbase-distributed-search
> 
> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com>:
> 
>> I'm working with HBase 0.94 for this case,, I'll try with 0.98, although
>> there is not difference.
>> I disabled the table and disabled the blockcache for that family and I put
>> scan.setBlockcache(false) as well for both cases.
>> 
>> I think that it's not possible that I executing an complete scan for each
>> thread since my data are the type:
>> 000001 f:q value=1
>> 000002 f:q value=2
>> 000003 f:q value=3
>> ...
>> 
>> I add all the values and get the same result on a single scan than a
>> distributed, so, I guess that DistributedScan did well.
>> The count from the hbase shell takes about 10-15seconds, I don't remember,
>> but like 4x  of the scan time.
>> I'm not using any filter for the scans.
>> 
>> This is the way I calculate number of regions/scans
>> private List<RegionScanner> generatePartitions() {
>>        List<RegionScanner> regionScanners = new
>> ArrayList<RegionScanner>();
>>        byte[] startKey;
>>        byte[] stopKey;
>>        HConnection connection = null;
>>        HBaseAdmin hbaseAdmin = null;
>>        try {
>>            connection =
>> HConnectionManager.createConnection(HBaseConfiguration.create());
>>            hbaseAdmin = new HBaseAdmin(connection);
>>            List<HRegionInfo> regions =
>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>            RegionScanner regionScanner = null;
>>            for (HRegionInfo region : regions) {
>> 
>>                startKey = region.getStartKey();
>>                stopKey = region.getEndKey();
>> 
>>                regionScanner = new RegionScanner(startKey, stopKey,
>> scanConfiguration);
>>                // regionScanner = createRegionScanner(startKey, stopKey);
>>                if (regionScanner != null) {
>>                    regionScanners.add(regionScanner);
>>                }
>>            }
>> 
>> I did some test for a tiny table and I think that the range for each scan
>> works fine. Although, I though that it was interesting that the time when I
>> execute distributed scan is about 6x.
>> 
>> I'm going to check about the hard disks, but I think that ti's right.
>> 
>> 
>> 
>> 
>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <larsh@apache.org>:
>> 
>>> Which version of HBase?
>>> Can you show us the code?
>>> 
>>> 
>>> Your parallel scan with caching 100 takes about 6x as long as the single
>>> scan, which is suspicious because you say you have 6 regions.
>>> Are you sure you're not accidentally scanning all the data in each of
>>> your parallel scans?
>>> 
>>> -- Lars
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: Guillermo Ortiz <konstt2000@gmail.com>
>>> To: "user@hbase.apache.org" <user@hbase.apache.org>
>>> Sent: Wednesday, September 10, 2014 1:40 AM
>>> Subject: Scan vs Parallel scan.
>>> 
>>> 
>>> Hi,
>>> 
>>> I developed an distributed scan, I create an thread for each region. After
>>> that, I've tried to get some times Scan vs DistributedScan.
>>> I have disabled blockcache in my table. My cluster has 3 region servers
>>> with 2 regions each one, in total there are 100.000 rows and execute a
>>> complete scan.
>>> 
>>> My partitions are
>>> -01666 -> request 16665
>>> 016666-033332 -> request 16666
>>> 033332-049998 -> request 16666
>>> 049998-066664 -> request 16666
>>> 066664-083330 -> request 16666
>>> 083330- -> request 16671
>>> 
>>> 
>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 ->
>>> Caching 10
>>> 
>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 ->
>>> Caching 100
>>> 
>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 ->
>>> Caching 1000
>>> 
>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
>>> Caching 1
>>> 
>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
>>> Caching 100
>>> 
>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
>>> Caching 1000
>>> 
>>> Parallel scan works much worse than simple scan,, and I don't know why
>>> it's
>>> so fast,, it's really much faster than execute an "count" from hbase
>>> shell,
>>> what it doesn't look pretty notmal. The only time that it works better
>>> parallel is when I execute a normal scan with caching 1.
>>> 
>>> Any clue about it?
>>> 
>> 
>> 


Mime
View raw message