Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C354D10CDA for ; Wed, 3 Jul 2013 14:59:52 +0000 (UTC) Received: (qmail 5453 invoked by uid 500); 3 Jul 2013 14:59:50 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 5407 invoked by uid 500); 3 Jul 2013 14:59:50 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 5399 invoked by uid 99); 3 Jul 2013 14:59:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Jul 2013 14:59:49 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of Tony.Dean@sas.com designates 216.32.181.185 as permitted sender) Received: from [216.32.181.185] (HELO ch1outboundpool.messaging.microsoft.com) (216.32.181.185) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Jul 2013 14:59:43 +0000 Received: from mail217-ch1-R.bigfish.com (10.43.68.235) by CH1EHSOBE005.bigfish.com (10.43.70.55) with Microsoft SMTP Server id 14.1.225.22; Wed, 3 Jul 2013 14:59:21 +0000 Received: from mail217-ch1 (localhost [127.0.0.1]) by mail217-ch1-R.bigfish.com (Postfix) with ESMTP id D75ED160249 for ; Wed, 3 Jul 2013 14:59:21 +0000 (UTC) X-Forefront-Antispam-Report: CIP:149.173.6.149;KIP:(null);UIP:(null);IPV:NLI;H:mercav05r.na.sas.com;RD:mercav05r.na.sas.com;EFVD:NLI X-SpamScore: -7 X-BigFish: S-7(z569dhz62a3I98dI9371I15bfK542I1432Izz1f42h1ee6h1de0h1fdah2073h1202h1e76h1d1ah1d2ah1fc6hzz17326ah18602eh8275bh8275dhz2fh2a8h668h839h944hcf6hd25hf0ah1220h1288h12a5h12a9h12bdh137ah13b6h1441h1504h1537h153bh15d0h162dh1631h1758h18e1h1946h19b5h1b0ah1d0ch1d2eh1d3fh1dfeh1dffh1e1dh1c22i1155h) Received-SPF: pass (mail217-ch1: domain of sas.com designates 149.173.6.149 as permitted sender) client-ip=149.173.6.149; envelope-from=Tony.Dean@sas.com; helo=mercav05r.na.sas.com ;r.na.sas.com ; Received: from mail217-ch1 (localhost.localdomain [127.0.0.1]) by mail217-ch1 (MessageSwitch) id 1372863555484040_26218; Wed, 3 Jul 2013 14:59:15 +0000 (UTC) Received: from CH1EHSMHS024.bigfish.com (snatpool2.int.messaging.microsoft.com [10.43.68.235]) by mail217-ch1.bigfish.com (Postfix) with ESMTP id 6E29A4A0066 for ; Wed, 3 Jul 2013 14:59:15 +0000 (UTC) Received: from mercav05r.na.sas.com (149.173.6.149) by CH1EHSMHS024.bigfish.com (10.43.70.24) with Microsoft SMTP Server (TLS) id 14.1.225.23; Wed, 3 Jul 2013 14:59:10 +0000 X-TM-IMSS-Message-ID: <68f02f72003f9b6e@mercav05r.na.sas.com> Received: from MERCHUB02R.na.SAS.com ([10.19.10.49]) by mercav05r.na.sas.com ([10.16.10.186]) with ESMTP (TREND IMSS SMTP Service 7.1; TLSv1/SSLv3 AES128-SHA (128/128)) id 68f02f72003f9b6e ; Wed, 3 Jul 2013 10:59:09 -0400 Received: from MERCMBX14D.na.SAS.com ([fe80::d893:d376:a28f:7f43]) by MERCHUB02R.na.SAS.com ([10.19.10.49]) with mapi id 14.03.0123.003; Wed, 3 Jul 2013 10:59:08 -0400 From: Tony Dean To: "user@hbase.apache.org" Subject: RE: Scan performance Thread-Topic: Scan performance Thread-Index: Ac5uvMDjcZeezNTMS+GeJhDIeEpfQgANLx+AAALfEwAAHBO5AABrcVvwAAlBiYABibTRQAAK7AOAABrN/CA= Date: Wed, 3 Jul 2013 14:59:08 +0000 Message-ID: References: <1371854271.43033.YahooMailNeo@web140606.mail.bf1.yahoo.com> <1371907440.21443.YahooMailNeo@web140602.mail.bf1.yahoo.com> <1372107927.91572.YahooMailNeo@web140605.mail.bf1.yahoo.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.26.16.88] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: sas.com X-Virus-Checked: Checked by ClamAV on apache.org Thanks Ted. -----Original Message----- From: Ted Yu [mailto:yuzhihong@gmail.com]=20 Sent: Tuesday, July 02, 2013 6:11 PM To: user@hbase.apache.org Subject: Re: Scan performance Tony: Take a look at http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-n= eed-for-secondary-indexes-in-hbase/ Cheers On Tue, Jul 2, 2013 at 2:31 PM, Tony Dean wrote: > The following information is what I discovered from Scan performance=20 > testing. > > Setup > ------- > row key format: > positiion1,position2,position3 > where position1 is a fixed literal, and position2 and position3 are=20 > variable data. > > I have created data with 6000 rows with ~40 columns in each row. The=20 > table contains only 1 column family. > > The row that I want to query is: > vid,sid-0,Logon event:customer value=3D? > > ------- > > Case 1: > use fully qualified row specification in start/stop row key (e.g., > vid,sid-0,Logon) with a SingleColumnValueFilter in the Scan. > > avg response time to get Scan iterator and iterate the single result=20 > is ~5ms. This is expected. > > > Case 2: > This is the normal case where position2 in the row key is unknown at=20 > the time of the query: vid,?,Logon. > Using a SingleColumnValueFilter in the Scan, the avg response time to=20 > get Scan iterator and iterate the single result is ~100ms. > This is the use case that I'm trying to improve upon. > > Case 3: > After upgrading to 0.94.8 I was able to change Case2 by using=20 > FuzzyRowFilter instead of SingleColumnValueFilter. It's a good=20 > candidate since I know position1 and position3. > The avg response time to get Scan iterator and iterate the single=20 > result was ~5ms (pretty much the same response time as case 1 where I=20 > knew the complete row key). > > I didn't expect such an improvement. Can you explain how=20 > FuzzyRowFilter optimizes scanning rows from disk? In my case it needs=20 > to scan rows > (vid,?,xxxx) until xxxx is greater than "Logon". Then it can just=20 > stop after that; thereby optimizing the scan, correct? So,=20 > optimization using FuzzyRowFilter is very dependent upon the data that yo= u are scanning. > > Thanks for any insight. > > > -----Original Message----- > From: lars hofhansl [mailto:larsh@apache.org] > Sent: Monday, June 24, 2013 5:05 PM > To: user@hbase.apache.org > Subject: Re: Scan performance > > RowFilter can help. It depends on the setup. > RowFilter skip all column of the row when the row key does not match. > That will help with IO *if* your rows are larger than the HFile block=20 > size (64k by default). Otherwise it still needs to touch each block. > > An HTable does some priming when it is created. The region information=20 > for all tables could be substantial, so it does not make much sense to=20 > prime the cache for all tables. > How are you using the client. If you pre-create a reuse HTable and/or=20 > HConnection you should be OK. > > > -- Lars > > > > ________________________________ > From: Tony Dean > To: "user@hbase.apache.org" ; lars hofhansl <=20 > larsh@apache.org> > Sent: Monday, June 24, 2013 1:48 PM > Subject: RE: Scan performance > > > Lars, > I'm waiting for some time to exchange out hbase jars in cluster (that=20 > support FuzzyRow filter) in order to try out. In the meantime, I'm=20 > wondering why RowFilter regex is not more helpful. I'm guessing that=20 > FuzzyRow filter helps in disk io while Row filter just filters after=20 > the disk io has completed. Also, I turned on row level bloom filter=20 > which does not seem to help either. > > On a different performance note, I'm wondering if there is a way to=20 > prime client connection information and such so that the first client=20 > query isn't miserably slow. After the first query, response times do=20 > get considerably better due to caching necessary information. Is=20 > there a way to get around this first initial hit? I assume any such=20 > priming would have to be application specific. > > Thanks. > > -----Original Message----- > From: lars hofhansl [mailto:larsh@apache.org] > Sent: Saturday, June 22, 2013 9:24 AM > To: user@hbase.apache.org > Subject: Re: Scan performance > > "essential column families" help when you filter on one column but=20 > want to return *other* columns for the rows that matched the column. > > Check out HBASE-5416. > > -- Lars > > > > ________________________________ > From: Vladimir Rodionov > To: "user@hbase.apache.org" ; lars hofhansl <=20 > larsh@apache.org> > Sent: Friday, June 21, 2013 5:00 PM > Subject: RE: Scan performance > > > Lars, > I thought that column family is the locality group and placement=20 > columns which are frequently accessed together into the same column=20 > family (locality group) is the obvious performance improvement tip.=20 > What are the "essential column families" for in this context? > > As for original question.. Unless you place your column into a=20 > separate column family in Table 2, you will need to scan (load from=20 > disk if not cached) ~ 40x more data for the second table (because you=20 > have 40 columns). This may explain why do see such a difference in=20 > execution time if all data needs to be loaded first from HDFS. > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: vrodionov@carrieriq.com > > ________________________________________ > From: lars hofhansl [larsh@apache.org] > Sent: Friday, June 21, 2013 3:37 PM > To: user@hbase.apache.org > Subject: Re: Scan performance > > HBase is a key value (KV) store. Each column is stored in its own KV,=20 > a row is just a set of KVs that happen to have the row key (which is=20 > the first part of the key). > I tried to summarize this here: > http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html) > > In the StoreFiles all KVs are sorted in row/column order, but HBase=20 > still needs to skip over many KVs in order to "reach" the next row. So=20 > more disk and memory IO is needed. > > If you using 0.94 there is a new feature "essential column families".=20 > If you always search by the same column you can place that one in its=20 > own column family and all other column in another column family. In=20 > that case your scan performance should be close identical. > > > -- Lars > ________________________________ > > From: Tony Dean > To: "user@hbase.apache.org" > Sent: Friday, June 21, 2013 2:08 PM > Subject: Scan performance > > > > > Hi, > > I hope that you can shed some light on these 2 scenarios below. > > I have 2 small tables of 6000 rows. > Table 1 has only 1 column in each of its rows. > Table 2 has 40 columns in each of its rows. > Other than that the two tables are identical. > > In both tables there is only 1 row that contains a matching column that I > am filtering on. And the Scan performs correctly in both cases by > returning only the single result. > > The code looks something like the following: > > Scan scan =3D new Scan(startRow, stopRow); // the start/stop rows shoul= d > include all 6000 rows > scan.addColumn(cf, qualifier); // only return the column that I am=20 > interested in (should only be in 1 row and only 1 version) > > Filter f1 =3D new InclusiveStopFilter(stopRow); Filter f2 =3D new=20 > SingleColumnValueFilter(cf, qualifier, CompareFilter.CompareOp.EQUALS,=20 > value); scan.setFilter(new FilterList(f1, f2)); > > scan .setTimeRange(0, MAX_LONG); > scan.setMaxVersions(1); > > ResultScanner rs =3D t.getScanner(scan); for (Result result: rs) { > > } > > For table 1, rs.next() takes about 30ms. > For table 2, rs.next() takes about 180ms. > > Both are returning the exact same result. Why is it taking so much longe= r > on table 2 to get the same result? The scan depth is the same. The only > difference is the column width. But I'm filtering on a single column and > returning only that column. > > Am I missing something? As I increase the number of columns, the respons= e > time gets worse. I do expect the response time to get worse when > increasing the number of rows, but not by increasing the number of column= s > since I'm returning only 1 column in > both cases. > > I appreciate any comments that you have. > > -Tony > > > > Tony Dean > SAS Institute Inc. > Principal Software Developer > 919-531-6704 ... > > Confidentiality Notice: The information contained in this message, > including any attachments hereto, may be confidential and is intended to = be > read only by the individual or entity to whom this message is addressed. = If > the reader of this message is not the intended recipient or an agent or > designee of the intended recipient, please note that any review, use, > disclosure or distribution of this message or its attachments, in any for= m, > is strictly prohibited. If you have received this message in error, plea= se > immediately notify the sender and/or Notifications@carrieriq.com and > delete or destroy any copy of this message and its attachments. >