hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Scan vs Parallel scan.
Date Fri, 12 Sep 2014 07:40:48 GMT
Hi, 

I wanted to take a step back from the actual code and to stop and think about what you are
doing and what HBase is doing under the covers. 

So in your code, you are asking HBase to do 3 separate scans and then you take the result
set back and join it. 

What does HBase do when it does a range scan? 
What happens when that range scan exceeds a single region? 

If you answer those questions… you’ll have your answer. 

HTH

-Mike

On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <konstt2000@gmail.com> wrote:

> It's not all the code, I set things like these as well:
> scan.setMaxVersions();
> scan.setCacheBlocks(false);
> ...
> 
> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com>:
> 
>> yes, that is. I have changed the HBase version to 0.98
>> 
>> I got the start and stop keys with this method:
>> private List<RegionScanner> generatePartitions() {
>>        List<RegionScanner> regionScanners = new
>> ArrayList<RegionScanner>();
>>        byte[] startKey;
>>        byte[] stopKey;
>>        HConnection connection = null;
>>        HBaseAdmin hbaseAdmin = null;
>>        try {
>>            connection = HConnectionManager.
>> createConnection(HBaseConfiguration.create());
>>            hbaseAdmin = new HBaseAdmin(connection);
>>            List<HRegionInfo> regions =
>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>            RegionScanner regionScanner = null;
>>            for (HRegionInfo region : regions) {
>> 
>>                startKey = region.getStartKey();
>>                stopKey = region.getEndKey();
>> 
>>                regionScanner = new RegionScanner(startKey, stopKey,
>> scanConfiguration);
>>                // regionScanner = createRegionScanner(startKey, stopKey);
>>                if (regionScanner != null) {
>>                    regionScanners.add(regionScanner);
>>                }
>>            }
>> 
>> And I execute the RegionScanner with this:
>> public List<Result> call() throws Exception {
>>        HConnection connection =
>> HConnectionManager.createConnection(HBaseConfiguration.create());
>>        HTableInterface table =
>> connection.getTable(configuration.getTable());
>> 
>>    Scan scan = new Scan(startKey, stopKey);
>>        scan.setBatch(configuration.getBatch());
>>        scan.setCaching(configuration.getCaching());
>>        ResultScanner resultScanner = table.getScanner(scan);
>> 
>>        List<Result> results = new ArrayList<Result>();
>>        for (Result result : resultScanner) {
>>            results.add(result);
>>        }
>> 
>>        connection.close();
>>        table.close();
>> 
>>        return results;
>>    }
>> 
>> They implement Callable.
>> 
>> 
>> 2014-09-12 9:26 GMT+02:00 Michael Segel <michael_segel@hotmail.com>:
>> 
>>> Lets take a step back….
>>> 
>>> Your parallel scan is having the client create N threads where in each
>>> thread, you’re doing a partial scan of the table where each partial scan
>>> takes the first and last row of each region?
>>> 
>>> Is that correct?
>>> 
>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <konstt2000@gmail.com>
>>> wrote:
>>> 
>>>> I was checking a little bit more about,, I checked the cluster and data
>>> is
>>>> store in three different regions servers, each one in a differente node.
>>>> So, I guess the threads go to different hard-disks.
>>>> 
>>>> If someone has an idea or suggestion.. why it's faster a single scan
>>> than
>>>> this implementation. I based on this implementation
>>>> https://github.com/zygm0nt/hbase-distributed-search
>>>> 
>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com>:
>>>> 
>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
>>> although
>>>>> there is not difference.
>>>>> I disabled the table and disabled the blockcache for that family and
I
>>> put
>>>>> scan.setBlockcache(false) as well for both cases.
>>>>> 
>>>>> I think that it's not possible that I executing an complete scan for
>>> each
>>>>> thread since my data are the type:
>>>>> 000001 f:q value=1
>>>>> 000002 f:q value=2
>>>>> 000003 f:q value=3
>>>>> ...
>>>>> 
>>>>> I add all the values and get the same result on a single scan than a
>>>>> distributed, so, I guess that DistributedScan did well.
>>>>> The count from the hbase shell takes about 10-15seconds, I don't
>>> remember,
>>>>> but like 4x  of the scan time.
>>>>> I'm not using any filter for the scans.
>>>>> 
>>>>> This is the way I calculate number of regions/scans
>>>>> private List<RegionScanner> generatePartitions() {
>>>>>       List<RegionScanner> regionScanners = new
>>>>> ArrayList<RegionScanner>();
>>>>>       byte[] startKey;
>>>>>       byte[] stopKey;
>>>>>       HConnection connection = null;
>>>>>       HBaseAdmin hbaseAdmin = null;
>>>>>       try {
>>>>>           connection =
>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
>>>>>           hbaseAdmin = new HBaseAdmin(connection);
>>>>>           List<HRegionInfo> regions =
>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>>>>           RegionScanner regionScanner = null;
>>>>>           for (HRegionInfo region : regions) {
>>>>> 
>>>>>               startKey = region.getStartKey();
>>>>>               stopKey = region.getEndKey();
>>>>> 
>>>>>               regionScanner = new RegionScanner(startKey, stopKey,
>>>>> scanConfiguration);
>>>>>               // regionScanner = createRegionScanner(startKey,
>>> stopKey);
>>>>>               if (regionScanner != null) {
>>>>>                   regionScanners.add(regionScanner);
>>>>>               }
>>>>>           }
>>>>> 
>>>>> I did some test for a tiny table and I think that the range for each
>>> scan
>>>>> works fine. Although, I though that it was interesting that the time
>>> when I
>>>>> execute distributed scan is about 6x.
>>>>> 
>>>>> I'm going to check about the hard disks, but I think that ti's right.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <larsh@apache.org>:
>>>>> 
>>>>>> Which version of HBase?
>>>>>> Can you show us the code?
>>>>>> 
>>>>>> 
>>>>>> Your parallel scan with caching 100 takes about 6x as long as the
>>> single
>>>>>> scan, which is suspicious because you say you have 6 regions.
>>>>>> Are you sure you're not accidentally scanning all the data in each
of
>>>>>> your parallel scans?
>>>>>> 
>>>>>> -- Lars
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ________________________________
>>>>>> From: Guillermo Ortiz <konstt2000@gmail.com>
>>>>>> To: "user@hbase.apache.org" <user@hbase.apache.org>
>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
>>>>>> Subject: Scan vs Parallel scan.
>>>>>> 
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I developed an distributed scan, I create an thread for each region.
>>> After
>>>>>> that, I've tried to get some times Scan vs DistributedScan.
>>>>>> I have disabled blockcache in my table. My cluster has 3 region
>>> servers
>>>>>> with 2 regions each one, in total there are 100.000 rows and execute
a
>>>>>> complete scan.
>>>>>> 
>>>>>> My partitions are
>>>>>> -01666 -> request 16665
>>>>>> 016666-033332 -> request 16666
>>>>>> 033332-049998 -> request 16666
>>>>>> 049998-066664 -> request 16666
>>>>>> 066664-083330 -> request 16666
>>>>>> 083330- -> request 16671
>>>>>> 
>>>>>> 
>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
>>> PARALLEL:22089ms,Counter:2 ->
>>>>>> Caching 10
>>>>>> 
>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
>>> PARALJEL:16598ms,Counter:2 ->
>>>>>> Caching 100
>>>>>> 
>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
>>> PARALLEL:16497ms,Counter:2 ->
>>>>>> Caching 1000
>>>>>> 
>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2
>>> ->
>>>>>> Caching 1
>>>>>> 
>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2
->
>>>>>> Caching 100
>>>>>> 
>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2
->
>>>>>> Caching 1000
>>>>>> 
>>>>>> Parallel scan works much worse than simple scan,, and I don't know
why
>>>>>> it's
>>>>>> so fast,, it's really much faster than execute an "count" from hbase
>>>>>> shell,
>>>>>> what it doesn't look pretty notmal. The only time that it works better
>>>>>> parallel is when I execute a normal scan with caching 1.
>>>>>> 
>>>>>> Any clue about it?
>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>> 


Mime
View raw message