Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 96462117EB for ; Fri, 12 Sep 2014 13:24:28 +0000 (UTC) Received: (qmail 23867 invoked by uid 500); 12 Sep 2014 13:24:26 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 23798 invoked by uid 500); 12 Sep 2014 13:24:25 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 23787 invoked by uid 99); 12 Sep 2014 13:24:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Sep 2014 13:24:25 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com designates 65.55.111.96 as permitted sender) Received: from [65.55.111.96] (HELO BLU004-OMC2S21.hotmail.com) (65.55.111.96) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Sep 2014 13:23:58 +0000 Received: from BLU436-SMTP124 ([65.55.111.71]) by BLU004-OMC2S21.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.22724); Fri, 12 Sep 2014 06:23:57 -0700 X-TMN: [ifn/ataijEXbS7T4Aeyeb+A5c6+KcbGtbHZ+na9QtCM=] X-Originating-Email: [michael_segel@hotmail.com] Message-ID: Received: from [192.168.1.131] ([213.205.241.117]) by BLU436-SMTP124.smtp.hotmail.com over TLS secured channel with Microsoft SMTPSVC(8.0.9200.16384); Fri, 12 Sep 2014 06:23:54 -0700 Content-Type: text/plain; charset="windows-1252" MIME-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: Scan vs Parallel scan. From: Michael Segel In-Reply-To: Date: Fri, 12 Sep 2014 14:23:42 +0100 Content-Transfer-Encoding: quoted-printable References: <1410414612.16635.YahooMailNeo@web140603.mail.bf1.yahoo.com> To: user@hbase.apache.org X-Mailer: Apple Mail (2.1878.6) X-OriginalArrivalTime: 12 Sep 2014 13:23:55.0719 (UTC) FILETIME=[CD99DD70:01CFCE8C] X-Virus-Checked: Checked by ClamAV on apache.org It doesn=92t matter which RS, but that you have 1 thread for each = region.=20 So for each thread, what=92s happening.=20 Step by step, what is the code doing.=20 Now you=92re comparing this against a single table scan, right?=20 What=92s happening in the table scan=85? On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz = wrote: > Right, My table for example has keys between 0-9. in three regions > 0-2,3-7,7-9 > I lauch three partial scans in parallel. The scans that I'm executing = are: > scan(0,2), scan(3,7), scan(7,9). > Each region is if a different RS, so each thread goes to different RS. = It's > not exactly like that, but on the benchmark case it's like it's = working. >=20 > Really the code will execute a thread for each Region not for each > RegionServer. But in the test I only have two regions for = regionServer. I > dont' think that's an important point, there're two threads for RS. >=20 > 2014-09-12 14:48 GMT+02:00 Michael Segel : >=20 >> Ok, lets again take a step back=85 >>=20 >> So you are comparing your partial scan(s) against a full table scan? >>=20 >> If I understood your question, you launch 3 partial scans where you = set >> the start row and then end row of each scan, right? >>=20 >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz = wrote: >>=20 >>> Okay, then, the partial scan doesn't work as I think. >>> How could it exceed the limit of a single region if I calculate the >> limits? >>>=20 >>>=20 >>> The only bad point that I see it's that If a region server has three >>> regions of the same table, I'm executing three partial scans about = this >> RS >>> and they could compete for resources (network, etc..) on this node. = It'd >> be >>> better to have one thread for RS. But, that doesn't answer your >> questions. >>>=20 >>> I keep thinking... >>>=20 >>> 2014-09-12 9:40 GMT+02:00 Michael Segel : >>>=20 >>>> Hi, >>>>=20 >>>> I wanted to take a step back from the actual code and to stop and = think >>>> about what you are doing and what HBase is doing under the covers. >>>>=20 >>>> So in your code, you are asking HBase to do 3 separate scans and = then >> you >>>> take the result set back and join it. >>>>=20 >>>> What does HBase do when it does a range scan? >>>> What happens when that range scan exceeds a single region? >>>>=20 >>>> If you answer those questions=85 you=92ll have your answer. >>>>=20 >>>> HTH >>>>=20 >>>> -Mike >>>>=20 >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz >> wrote: >>>>=20 >>>>> It's not all the code, I set things like these as well: >>>>> scan.setMaxVersions(); >>>>> scan.setCacheBlocks(false); >>>>> ... >>>>>=20 >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz : >>>>>=20 >>>>>> yes, that is. I have changed the HBase version to 0.98 >>>>>>=20 >>>>>> I got the start and stop keys with this method: >>>>>> private List generatePartitions() { >>>>>> List regionScanners =3D new >>>>>> ArrayList(); >>>>>> byte[] startKey; >>>>>> byte[] stopKey; >>>>>> HConnection connection =3D null; >>>>>> HBaseAdmin hbaseAdmin =3D null; >>>>>> try { >>>>>> connection =3D HConnectionManager. >>>>>> createConnection(HBaseConfiguration.create()); >>>>>> hbaseAdmin =3D new HBaseAdmin(connection); >>>>>> List regions =3D >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); >>>>>> RegionScanner regionScanner =3D null; >>>>>> for (HRegionInfo region : regions) { >>>>>>=20 >>>>>> startKey =3D region.getStartKey(); >>>>>> stopKey =3D region.getEndKey(); >>>>>>=20 >>>>>> regionScanner =3D new RegionScanner(startKey, = stopKey, >>>>>> scanConfiguration); >>>>>> // regionScanner =3D createRegionScanner(startKey, >>>> stopKey); >>>>>> if (regionScanner !=3D null) { >>>>>> regionScanners.add(regionScanner); >>>>>> } >>>>>> } >>>>>>=20 >>>>>> And I execute the RegionScanner with this: >>>>>> public List call() throws Exception { >>>>>> HConnection connection =3D >>>>>> HConnectionManager.createConnection(HBaseConfiguration.create()); >>>>>> HTableInterface table =3D >>>>>> connection.getTable(configuration.getTable()); >>>>>>=20 >>>>>> Scan scan =3D new Scan(startKey, stopKey); >>>>>> scan.setBatch(configuration.getBatch()); >>>>>> scan.setCaching(configuration.getCaching()); >>>>>> ResultScanner resultScanner =3D table.getScanner(scan); >>>>>>=20 >>>>>> List results =3D new ArrayList(); >>>>>> for (Result result : resultScanner) { >>>>>> results.add(result); >>>>>> } >>>>>>=20 >>>>>> connection.close(); >>>>>> table.close(); >>>>>>=20 >>>>>> return results; >>>>>> } >>>>>>=20 >>>>>> They implement Callable. >>>>>>=20 >>>>>>=20 >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel = : >>>>>>=20 >>>>>>> Lets take a step back=85. >>>>>>>=20 >>>>>>> Your parallel scan is having the client create N threads where = in >> each >>>>>>> thread, you=92re doing a partial scan of the table where each = partial >>>> scan >>>>>>> takes the first and last row of each region? >>>>>>>=20 >>>>>>> Is that correct? >>>>>>>=20 >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz = >>>>>>> wrote: >>>>>>>=20 >>>>>>>> I was checking a little bit more about,, I checked the cluster = and >>>> data >>>>>>> is >>>>>>>> store in three different regions servers, each one in a = differente >>>> node. >>>>>>>> So, I guess the threads go to different hard-disks. >>>>>>>>=20 >>>>>>>> If someone has an idea or suggestion.. why it's faster a single = scan >>>>>>> than >>>>>>>> this implementation. I based on this implementation >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search >>>>>>>>=20 >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz = : >>>>>>>>=20 >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with = 0.98, >>>>>>> although >>>>>>>>> there is not difference. >>>>>>>>> I disabled the table and disabled the blockcache for that = family >> and >>>> I >>>>>>> put >>>>>>>>> scan.setBlockcache(false) as well for both cases. >>>>>>>>>=20 >>>>>>>>> I think that it's not possible that I executing an complete = scan >> for >>>>>>> each >>>>>>>>> thread since my data are the type: >>>>>>>>> 000001 f:q value=3D1 >>>>>>>>> 000002 f:q value=3D2 >>>>>>>>> 000003 f:q value=3D3 >>>>>>>>> ... >>>>>>>>>=20 >>>>>>>>> I add all the values and get the same result on a single scan = than >> a >>>>>>>>> distributed, so, I guess that DistributedScan did well. >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I = don't >>>>>>> remember, >>>>>>>>> but like 4x of the scan time. >>>>>>>>> I'm not using any filter for the scans. >>>>>>>>>=20 >>>>>>>>> This is the way I calculate number of regions/scans >>>>>>>>> private List generatePartitions() { >>>>>>>>> List regionScanners =3D new >>>>>>>>> ArrayList(); >>>>>>>>> byte[] startKey; >>>>>>>>> byte[] stopKey; >>>>>>>>> HConnection connection =3D null; >>>>>>>>> HBaseAdmin hbaseAdmin =3D null; >>>>>>>>> try { >>>>>>>>> connection =3D >>>>>>>>> = HConnectionManager.createConnection(HBaseConfiguration.create()); >>>>>>>>> hbaseAdmin =3D new HBaseAdmin(connection); >>>>>>>>> List regions =3D >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); >>>>>>>>> RegionScanner regionScanner =3D null; >>>>>>>>> for (HRegionInfo region : regions) { >>>>>>>>>=20 >>>>>>>>> startKey =3D region.getStartKey(); >>>>>>>>> stopKey =3D region.getEndKey(); >>>>>>>>>=20 >>>>>>>>> regionScanner =3D new RegionScanner(startKey, = stopKey, >>>>>>>>> scanConfiguration); >>>>>>>>> // regionScanner =3D createRegionScanner(startKey, >>>>>>> stopKey); >>>>>>>>> if (regionScanner !=3D null) { >>>>>>>>> regionScanners.add(regionScanner); >>>>>>>>> } >>>>>>>>> } >>>>>>>>>=20 >>>>>>>>> I did some test for a tiny table and I think that the range = for >> each >>>>>>> scan >>>>>>>>> works fine. Although, I though that it was interesting that = the >> time >>>>>>> when I >>>>>>>>> execute distributed scan is about 6x. >>>>>>>>>=20 >>>>>>>>> I'm going to check about the hard disks, but I think that ti's >> right. >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl : >>>>>>>>>=20 >>>>>>>>>> Which version of HBase? >>>>>>>>>> Can you show us the code? >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as long as = the >>>>>>> single >>>>>>>>>> scan, which is suspicious because you say you have 6 regions. >>>>>>>>>> Are you sure you're not accidentally scanning all the data in = each >>>> of >>>>>>>>>> your parallel scans? >>>>>>>>>>=20 >>>>>>>>>> -- Lars >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>> ________________________________ >>>>>>>>>> From: Guillermo Ortiz >>>>>>>>>> To: "user@hbase.apache.org" >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM >>>>>>>>>> Subject: Scan vs Parallel scan. >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>> Hi, >>>>>>>>>>=20 >>>>>>>>>> I developed an distributed scan, I create an thread for each >> region. >>>>>>> After >>>>>>>>>> that, I've tried to get some times Scan vs DistributedScan. >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3 = region >>>>>>> servers >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows and >>>> execute a >>>>>>>>>> complete scan. >>>>>>>>>>=20 >>>>>>>>>> My partitions are >>>>>>>>>> -01666 -> request 16665 >>>>>>>>>> 016666-033332 -> request 16666 >>>>>>>>>> 033332-049998 -> request 16666 >>>>>>>>>> 049998-066664 -> request 16666 >>>>>>>>>> 066664-083330 -> request 16666 >>>>>>>>>> 083330- -> request 16671 >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN >>>>>>> PARALLEL:22089ms,Counter:2 -> >>>>>>>>>> Caching 10 >>>>>>>>>>=20 >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN >>>>>>> PARALJEL:16598ms,Counter:2 -> >>>>>>>>>> Caching 100 >>>>>>>>>>=20 >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN >>>>>>> PARALLEL:16497ms,Counter:2 -> >>>>>>>>>> Caching 1000 >>>>>>>>>>=20 >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN >> NORMAL:68288ms,Counter:2 >>>>>>> -> >>>>>>>>>> Caching 1 >>>>>>>>>>=20 >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN >> NORMAL:2646ms,Counter:2 >>>> -> >>>>>>>>>> Caching 100 >>>>>>>>>>=20 >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN >> NORMAL:3903ms,Counter:2 >>>> -> >>>>>>>>>> Caching 1000 >>>>>>>>>>=20 >>>>>>>>>> Parallel scan works much worse than simple scan,, and I don't = know >>>> why >>>>>>>>>> it's >>>>>>>>>> so fast,, it's really much faster than execute an "count" = from >> hbase >>>>>>>>>> shell, >>>>>>>>>> what it doesn't look pretty notmal. The only time that it = works >>>> better >>>>>>>>>> parallel is when I execute a normal scan with caching 1. >>>>>>>>>>=20 >>>>>>>>>> Any clue about it? >>>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>=20 >>>>>>>=20 >>>>>>=20 >>>>=20 >>>>=20 >>=20 >>=20