Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of Tony.Dean@sas.com designates
 216.32.180.13 as permitted sender)
Received-SPF: pass (mail218-va3: domain of sas.com designates 149.173.6.149 as
 permitted sender) client-ip=149.173.6.149; envelope-from=Tony.Dean@sas.com;
 helo=mercav05r.na.sas.com ;r.na.sas.com ;
From: Tony Dean <Tony.Dean@sas.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Subject: RE: Scan performance
Thread-Topic: Scan performance
Thread-Index: 
 Ac5uvMDjcZeezNTMS+GeJhDIeEpfQgANLx+AAALfEwAAHBO5AABrcVvwAAlBiYABibTRQAAK7AOAAr4PIwA=
Date: Wed, 17 Jul 2013 01:29:26 +0000
Message-ID: <D634846B95D967479F3D9764E36EFCF10BF55BB1@MERCMBX14D.na.SAS.com>
References: <D634846B95D967479F3D9764E36EFCF10BF373F9@MERCMBX14D.na.SAS.com>
	<1371854271.43033.YahooMailNeo@web140606.mail.bf1.yahoo.com>
	<DC5EBE7F3610EB4CA5C7E92D76873E1517ECF1DC82@exchange2007.carrieriq.com>
	<1371907440.21443.YahooMailNeo@web140602.mail.bf1.yahoo.com>
	<D634846B95D967479F3D9764E36EFCF10BF3D2A5@MERCMBX14D.na.SAS.com>
	<1372107927.91572.YahooMailNeo@web140605.mail.bf1.yahoo.com>
	<D634846B95D967479F3D9764E36EFCF10BF48315@MERCMBX14D.na.SAS.com>
 <CALte62yyu=2Y9BG_TnHtZoDFDx8S0xwW2XLXrQ+Y+rtqGGFNbQ@mail.gmail.com>
In-Reply-To: 
 <CALte62yyu=2Y9BG_TnHtZoDFDx8S0xwW2XLXrQ+Y+rtqGGFNbQ@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

I was able to test scan performance with 0.94.9 with around 6000 rows X 40 =
columns and FuzzyRowFilter gave us 2-4 times better performance.  I was abl=
e to test this offline without any problems.  However, once I turned it on =
in our development cluster, we noticed that with some row keys that should =
have matched were not matching.  After reverting back to SingleColumnValueF=
ilter the cases that were failing, began to work again.  We thought that th=
e anomaly was due to certain data in row key, but we managed to create iden=
tical row keys in a different table and see the scan work.  So, bottom line=
 I can't explain this behavior.  Has anyone seen this behavior and does any=
one have debugging tips?

Thanks.

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com]=20
Sent: Tuesday, July 02, 2013 6:11 PM
To: user@hbase.apache.org
Subject: Re: Scan performance

Tony:
Take a look at
http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-n=
eed-for-secondary-indexes-in-hbase/

Cheers

On Tue, Jul 2, 2013 at 2:31 PM, Tony Dean <Tony.Dean@sas.com> wrote:

> The following information is what I discovered from Scan performance=20
> testing.
>
> Setup
> -------
> row key format:
> positiion1,position2,position3
> where position1 is a fixed literal, and position2 and position3 are=20
> variable data.
>
> I have created data with 6000 rows with ~40 columns in each row.  The=20
> table contains only 1 column family.
>
> The row that I want to query is:
> vid,sid-0,Logon    event:customer value=3D?
>
> -------
>
> Case 1:
> use fully qualified row specification in start/stop row key (e.g.,
> vid,sid-0,Logon) with a SingleColumnValueFilter in the Scan.
>
> avg response time to get Scan iterator and iterate the single result=20
> is ~5ms.  This is expected.
>
>
> Case 2:
> This is the normal case where position2 in the row key is unknown at=20
> the time of the query: vid,?,Logon.
> Using a SingleColumnValueFilter in the Scan, the avg response time to=20
> get Scan iterator and iterate the single result is ~100ms.
> This is the use case that I'm trying to improve upon.
>
> Case 3:
> After upgrading to 0.94.8 I was able to change Case2 by using=20
> FuzzyRowFilter instead of SingleColumnValueFilter.  It's a good=20
> candidate since I know position1 and position3.
> The avg response time to get Scan iterator and iterate the single=20
> result was ~5ms (pretty much the same response time as case 1 where I=20
> knew the complete row key).
>
> I didn't expect such an improvement.  Can you explain how=20
> FuzzyRowFilter optimizes scanning rows from disk?  In my case it needs=20
> to scan rows
> (vid,?,xxxx) until xxxx is greater than "Logon".  Then it can just=20
> stop after that; thereby optimizing the scan, correct?  So,=20
> optimization using FuzzyRowFilter is very dependent upon the data that yo=
u are scanning.
>
> Thanks for any insight.
>
>
> -----Original Message-----
> From: lars hofhansl [mailto:larsh@apache.org]
> Sent: Monday, June 24, 2013 5:05 PM
> To: user@hbase.apache.org
> Subject: Re: Scan performance
>
> RowFilter can help. It depends on the setup.
> RowFilter skip all column of the row when the row key does not match.
> That will help with IO *if* your rows are larger than the HFile block=20
> size (64k by default). Otherwise it still needs to touch each block.
>
> An HTable does some priming when it is created. The region information=20
> for all tables could be substantial, so it does not make much sense to=20
> prime the cache for all tables.
> How are you using the client. If you pre-create a reuse HTable and/or=20
> HConnection you should be OK.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Tony Dean <Tony.Dean@sas.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>; lars hofhansl <=20
> larsh@apache.org>
> Sent: Monday, June 24, 2013 1:48 PM
> Subject: RE: Scan performance
>
>
> Lars,
> I'm waiting for some time to exchange out hbase jars in cluster (that=20
> support FuzzyRow filter) in order to try out.  In the meantime, I'm=20
> wondering why RowFilter regex is not more helpful.  I'm guessing that=20
> FuzzyRow filter helps in disk io while Row filter just filters after=20
> the disk io has completed.  Also, I turned on row level bloom filter=20
> which does not seem to help either.
>
> On a different performance note, I'm wondering if there is a way to=20
> prime client connection information and such so that the first client=20
> query isn't miserably slow.  After the first query, response times do=20
> get considerably better due to caching necessary information.  Is=20
> there a way to get around this first initial hit?  I assume any such=20
> priming would have to be application specific.
>
> Thanks.
>
> -----Original Message-----
> From: lars hofhansl [mailto:larsh@apache.org]
> Sent: Saturday, June 22, 2013 9:24 AM
> To: user@hbase.apache.org
> Subject: Re: Scan performance
>
> "essential column families" help when you filter on one column but=20
> want to return *other* columns for the rows that matched the column.
>
> Check out HBASE-5416.
>
> -- Lars
>
>
>
> ________________________________
> From: Vladimir Rodionov <vrodionov@carrieriq.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>; lars hofhansl <=20
> larsh@apache.org>
> Sent: Friday, June 21, 2013 5:00 PM
> Subject: RE: Scan performance
>
>
> Lars,
> I thought that column family is the locality group and placement=20
> columns which are frequently accessed together into the same column=20
> family (locality group) is the obvious performance improvement tip.=20
> What are the "essential column families" for in this context?
>
> As for original question..  Unless you place your column into a=20
> separate column family in Table 2, you will need to scan (load from=20
> disk if not cached) ~ 40x more data for the second table (because you=20
> have 40 columns). This may explain why do  see such a difference in=20
> execution time if all data needs to be loaded first from HDFS.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: lars hofhansl [larsh@apache.org]
> Sent: Friday, June 21, 2013 3:37 PM
> To: user@hbase.apache.org
> Subject: Re: Scan performance
>
> HBase is a key value (KV) store. Each column is stored in its own KV,=20
> a row is just a set of KVs that happen to have the row key (which is=20
> the first part of the key).
> I tried to summarize this here:
> http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)
>
> In the StoreFiles all KVs are sorted in row/column order, but HBase=20
> still needs to skip over many KVs in order to "reach" the next row. So=20
> more disk and memory IO is needed.
>
> If you using 0.94 there is a new feature "essential column families".=20
> If you always search by the same column you can place that one in its=20
> own column family and all other column in another column family. In=20
> that case your scan performance should be close identical.
>
>
> -- Lars
> ________________________________
>
> From: Tony Dean <Tony.Dean@sas.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>
> Sent: Friday, June 21, 2013 2:08 PM
> Subject: Scan performance
>
>
>
>
> Hi,
>
> I hope that you can shed some light on these 2 scenarios below.
>
> I have 2 small tables of 6000 rows.
> Table 1 has only 1 column in each of its rows.
> Table 2 has 40 columns in each of its rows.
> Other than that the two tables are identical.
>
> In both tables there is only 1 row that contains a matching column that I
> am filtering on.   And the Scan performs correctly in both cases by
> returning only the single result.
>
> The code looks something like the following:
>
> Scan scan =3D new Scan(startRow, stopRow);   // the start/stop rows shoul=
d
> include all 6000 rows
> scan.addColumn(cf, qualifier); // only return the column that I am=20
> interested in (should only be in 1 row and only 1 version)
>
> Filter f1 =3D new InclusiveStopFilter(stopRow); Filter f2 =3D new=20
> SingleColumnValueFilter(cf, qualifier, CompareFilter.CompareOp.EQUALS,=20
> value); scan.setFilter(new FilterList(f1, f2));
>
> scan .setTimeRange(0, MAX_LONG);
> scan.setMaxVersions(1);
>
> ResultScanner rs =3D t.getScanner(scan); for (Result result: rs) {
>
> }
>
> For table 1, rs.next() takes about 30ms.
> For table 2, rs.next() takes about 180ms.
>
> Both are returning the exact same result.  Why is it taking so much longe=
r
> on table 2 to get the same result?  The scan depth is the same.  The only
> difference is the column width.  But I'm filtering on a single column and
> returning only that column.
>
> Am I missing something?  As I increase the number of columns, the respons=
e
> time gets worse.  I do expect the response time to get worse when
> increasing the number of rows, but not by increasing the number of column=
s
> since I'm returning only 1 column in
> both cases.
>
> I appreciate any comments that you have.
>
> -Tony
>
>
>
> Tony Dean
> SAS Institute Inc.
> Principal Software Developer
> 919-531-6704          ...
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to =
be
> read only by the individual or entity to whom this message is addressed. =
If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any for=
m,
> is strictly prohibited.  If you have received this message in error, plea=
se
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>