Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EF844F226 for ; Wed, 17 Jul 2013 01:30:34 +0000 (UTC) Received: (qmail 27705 invoked by uid 500); 17 Jul 2013 01:30:32 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 27590 invoked by uid 500); 17 Jul 2013 01:30:32 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 27582 invoked by uid 99); 17 Jul 2013 01:30:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jul 2013 01:30:32 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of Tony.Dean@sas.com designates 216.32.180.13 as permitted sender) Received: from [216.32.180.13] (HELO va3outboundpool.messaging.microsoft.com) (216.32.180.13) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jul 2013 01:30:28 +0000 Received: from mail218-va3-R.bigfish.com (10.7.14.231) by VA3EHSOBE002.bigfish.com (10.7.40.22) with Microsoft SMTP Server id 14.1.225.22; Wed, 17 Jul 2013 01:30:06 +0000 Received: from mail218-va3 (localhost [127.0.0.1]) by mail218-va3-R.bigfish.com (Postfix) with ESMTP id D30039A0B2C for ; Wed, 17 Jul 2013 01:30:06 +0000 (UTC) X-Forefront-Antispam-Report: CIP:149.173.6.149;KIP:(null);UIP:(null);IPV:NLI;H:mercav05r.na.sas.com;RD:mercav05r.na.sas.com;EFVD:NLI X-SpamScore: -7 X-BigFish: S-7(z569dhz62a3I98dI9371I15bfK542I1432Izz1f42h1ee6h1de0h1fdah2073h1202h1e76h1d1ah1d2ah1fc6hzz17326ah18602eh8275bh8275dhz2fh2a8h668h839h944hcf6hd25hf0ah1220h1288h12a5h12a9h12bdh137ah13b6h1441h1504h1537h153bh15d0h162dh1631h1758h18e1h1946h19b5h1b0ah1d0ch1d2eh1d3fh1dfeh1dffh1e1dh1c22i1155h) Received-SPF: pass (mail218-va3: domain of sas.com designates 149.173.6.149 as permitted sender) client-ip=149.173.6.149; envelope-from=Tony.Dean@sas.com; helo=mercav05r.na.sas.com ;r.na.sas.com ; Received: from mail218-va3 (localhost.localdomain [127.0.0.1]) by mail218-va3 (MessageSwitch) id 1374024571668344_708; Wed, 17 Jul 2013 01:29:31 +0000 (UTC) Received: from VA3EHSMHS019.bigfish.com (unknown [10.7.14.241]) by mail218-va3.bigfish.com (Postfix) with ESMTP id A049AD8010C for ; Wed, 17 Jul 2013 01:29:31 +0000 (UTC) Received: from mercav05r.na.sas.com (149.173.6.149) by VA3EHSMHS019.bigfish.com (10.7.99.29) with Microsoft SMTP Server (TLS) id 14.1.225.23; Wed, 17 Jul 2013 01:29:28 +0000 X-TM-IMSS-Message-ID: Received: from MERCHUB02R.na.SAS.com ([10.19.10.49]) by mercav05r.na.sas.com ([10.16.10.186]) with ESMTP (TREND IMSS SMTP Service 7.1; TLSv1/SSLv3 AES128-SHA (128/128)) id ae22f07f004a7656 ; Tue, 16 Jul 2013 21:29:27 -0400 Received: from MERCMBX14D.na.SAS.com ([fe80::d893:d376:a28f:7f43]) by MERCHUB02R.na.SAS.com ([10.19.10.49]) with mapi id 14.03.0123.003; Tue, 16 Jul 2013 21:29:27 -0400 From: Tony Dean To: "user@hbase.apache.org" Subject: RE: Scan performance Thread-Topic: Scan performance Thread-Index: Ac5uvMDjcZeezNTMS+GeJhDIeEpfQgANLx+AAALfEwAAHBO5AABrcVvwAAlBiYABibTRQAAK7AOAAr4PIwA= Date: Wed, 17 Jul 2013 01:29:26 +0000 Message-ID: References: <1371854271.43033.YahooMailNeo@web140606.mail.bf1.yahoo.com> <1371907440.21443.YahooMailNeo@web140602.mail.bf1.yahoo.com> <1372107927.91572.YahooMailNeo@web140605.mail.bf1.yahoo.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.26.16.88] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: sas.com X-Virus-Checked: Checked by ClamAV on apache.org I was able to test scan performance with 0.94.9 with around 6000 rows X 40 = columns and FuzzyRowFilter gave us 2-4 times better performance. I was abl= e to test this offline without any problems. However, once I turned it on = in our development cluster, we noticed that with some row keys that should = have matched were not matching. After reverting back to SingleColumnValueF= ilter the cases that were failing, began to work again. We thought that th= e anomaly was due to certain data in row key, but we managed to create iden= tical row keys in a different table and see the scan work. So, bottom line= I can't explain this behavior. Has anyone seen this behavior and does any= one have debugging tips? Thanks. -----Original Message----- From: Ted Yu [mailto:yuzhihong@gmail.com]=20 Sent: Tuesday, July 02, 2013 6:11 PM To: user@hbase.apache.org Subject: Re: Scan performance Tony: Take a look at http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-n= eed-for-secondary-indexes-in-hbase/ Cheers On Tue, Jul 2, 2013 at 2:31 PM, Tony Dean wrote: > The following information is what I discovered from Scan performance=20 > testing. > > Setup > ------- > row key format: > positiion1,position2,position3 > where position1 is a fixed literal, and position2 and position3 are=20 > variable data. > > I have created data with 6000 rows with ~40 columns in each row. The=20 > table contains only 1 column family. > > The row that I want to query is: > vid,sid-0,Logon event:customer value=3D? > > ------- > > Case 1: > use fully qualified row specification in start/stop row key (e.g., > vid,sid-0,Logon) with a SingleColumnValueFilter in the Scan. > > avg response time to get Scan iterator and iterate the single result=20 > is ~5ms. This is expected. > > > Case 2: > This is the normal case where position2 in the row key is unknown at=20 > the time of the query: vid,?,Logon. > Using a SingleColumnValueFilter in the Scan, the avg response time to=20 > get Scan iterator and iterate the single result is ~100ms. > This is the use case that I'm trying to improve upon. > > Case 3: > After upgrading to 0.94.8 I was able to change Case2 by using=20 > FuzzyRowFilter instead of SingleColumnValueFilter. It's a good=20 > candidate since I know position1 and position3. > The avg response time to get Scan iterator and iterate the single=20 > result was ~5ms (pretty much the same response time as case 1 where I=20 > knew the complete row key). > > I didn't expect such an improvement. Can you explain how=20 > FuzzyRowFilter optimizes scanning rows from disk? In my case it needs=20 > to scan rows > (vid,?,xxxx) until xxxx is greater than "Logon". Then it can just=20 > stop after that; thereby optimizing the scan, correct? So,=20 > optimization using FuzzyRowFilter is very dependent upon the data that yo= u are scanning. > > Thanks for any insight. > > > -----Original Message----- > From: lars hofhansl [mailto:larsh@apache.org] > Sent: Monday, June 24, 2013 5:05 PM > To: user@hbase.apache.org > Subject: Re: Scan performance > > RowFilter can help. It depends on the setup. > RowFilter skip all column of the row when the row key does not match. > That will help with IO *if* your rows are larger than the HFile block=20 > size (64k by default). Otherwise it still needs to touch each block. > > An HTable does some priming when it is created. The region information=20 > for all tables could be substantial, so it does not make much sense to=20 > prime the cache for all tables. > How are you using the client. If you pre-create a reuse HTable and/or=20 > HConnection you should be OK. > > > -- Lars > > > > ________________________________ > From: Tony Dean > To: "user@hbase.apache.org" ; lars hofhansl <=20 > larsh@apache.org> > Sent: Monday, June 24, 2013 1:48 PM > Subject: RE: Scan performance > > > Lars, > I'm waiting for some time to exchange out hbase jars in cluster (that=20 > support FuzzyRow filter) in order to try out. In the meantime, I'm=20 > wondering why RowFilter regex is not more helpful. I'm guessing that=20 > FuzzyRow filter helps in disk io while Row filter just filters after=20 > the disk io has completed. Also, I turned on row level bloom filter=20 > which does not seem to help either. > > On a different performance note, I'm wondering if there is a way to=20 > prime client connection information and such so that the first client=20 > query isn't miserably slow. After the first query, response times do=20 > get considerably better due to caching necessary information. Is=20 > there a way to get around this first initial hit? I assume any such=20 > priming would have to be application specific. > > Thanks. > > -----Original Message----- > From: lars hofhansl [mailto:larsh@apache.org] > Sent: Saturday, June 22, 2013 9:24 AM > To: user@hbase.apache.org > Subject: Re: Scan performance > > "essential column families" help when you filter on one column but=20 > want to return *other* columns for the rows that matched the column. > > Check out HBASE-5416. > > -- Lars > > > > ________________________________ > From: Vladimir Rodionov > To: "user@hbase.apache.org" ; lars hofhansl <=20 > larsh@apache.org> > Sent: Friday, June 21, 2013 5:00 PM > Subject: RE: Scan performance > > > Lars, > I thought that column family is the locality group and placement=20 > columns which are frequently accessed together into the same column=20 > family (locality group) is the obvious performance improvement tip.=20 > What are the "essential column families" for in this context? > > As for original question.. Unless you place your column into a=20 > separate column family in Table 2, you will need to scan (load from=20 > disk if not cached) ~ 40x more data for the second table (because you=20 > have 40 columns). This may explain why do see such a difference in=20 > execution time if all data needs to be loaded first from HDFS. > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: vrodionov@carrieriq.com > > ________________________________________ > From: lars hofhansl [larsh@apache.org] > Sent: Friday, June 21, 2013 3:37 PM > To: user@hbase.apache.org > Subject: Re: Scan performance > > HBase is a key value (KV) store. Each column is stored in its own KV,=20 > a row is just a set of KVs that happen to have the row key (which is=20 > the first part of the key). > I tried to summarize this here: > http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html) > > In the StoreFiles all KVs are sorted in row/column order, but HBase=20 > still needs to skip over many KVs in order to "reach" the next row. So=20 > more disk and memory IO is needed. > > If you using 0.94 there is a new feature "essential column families".=20 > If you always search by the same column you can place that one in its=20 > own column family and all other column in another column family. In=20 > that case your scan performance should be close identical. > > > -- Lars > ________________________________ > > From: Tony Dean > To: "user@hbase.apache.org" > Sent: Friday, June 21, 2013 2:08 PM > Subject: Scan performance > > > > > Hi, > > I hope that you can shed some light on these 2 scenarios below. > > I have 2 small tables of 6000 rows. > Table 1 has only 1 column in each of its rows. > Table 2 has 40 columns in each of its rows. > Other than that the two tables are identical. > > In both tables there is only 1 row that contains a matching column that I > am filtering on. And the Scan performs correctly in both cases by > returning only the single result. > > The code looks something like the following: > > Scan scan =3D new Scan(startRow, stopRow); // the start/stop rows shoul= d > include all 6000 rows > scan.addColumn(cf, qualifier); // only return the column that I am=20 > interested in (should only be in 1 row and only 1 version) > > Filter f1 =3D new InclusiveStopFilter(stopRow); Filter f2 =3D new=20 > SingleColumnValueFilter(cf, qualifier, CompareFilter.CompareOp.EQUALS,=20 > value); scan.setFilter(new FilterList(f1, f2)); > > scan .setTimeRange(0, MAX_LONG); > scan.setMaxVersions(1); > > ResultScanner rs =3D t.getScanner(scan); for (Result result: rs) { > > } > > For table 1, rs.next() takes about 30ms. > For table 2, rs.next() takes about 180ms. > > Both are returning the exact same result. Why is it taking so much longe= r > on table 2 to get the same result? The scan depth is the same. The only > difference is the column width. But I'm filtering on a single column and > returning only that column. > > Am I missing something? As I increase the number of columns, the respons= e > time gets worse. I do expect the response time to get worse when > increasing the number of rows, but not by increasing the number of column= s > since I'm returning only 1 column in > both cases. > > I appreciate any comments that you have. > > -Tony > > > > Tony Dean > SAS Institute Inc. > Principal Software Developer > 919-531-6704 ... > > Confidentiality Notice: The information contained in this message, > including any attachments hereto, may be confidential and is intended to = be > read only by the individual or entity to whom this message is addressed. = If > the reader of this message is not the intended recipient or an agent or > designee of the intended recipient, please note that any review, use, > disclosure or distribution of this message or its attachments, in any for= m, > is strictly prohibited. If you have received this message in error, plea= se > immediately notify the sender and/or Notifications@carrieriq.com and > delete or destroy any copy of this message and its attachments. >