Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of ghendrey@decarta.com designates
 208.81.204.160 as permitted sender)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Speeding up Scans
Date: Wed, 25 Jan 2012 11:17:37 -0800
Message-ID: 
 <6C5C1804772DB944BA88A0DC48D338DA0BD9FB74@dct-mail.sanjose.telcontar.com>
In-Reply-To: <4F204545.4030004@qualtrics.com>
Thread-Topic: Speeding up Scans
Thread-Index: AczbjIWD1WzsraHeQb2o2bhbCykLWQACP10A
References: <CB456915.22E9A%doug.meil@explorysmedical.com>
 <4F201A79.60300@gmail.com> <4F204545.4030004@qualtrics.com>
From: "Geoff Hendrey" <ghendrey@decarta.com>
To: <user@hbase.apache.org>

Sorry for jumping in late, and perhaps out of context, but I'm pasting
in some findings  (reported to this list by us a while back) that helped
us to get scans to perform very fast. Adjusting
hbase.client.prefetch.limit was critical for us.:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
It's even more mysterious than we think. There is lack of documentation
(or perhaps lack of know how). Apparently there are 2 factors that
decide the performance of scan.=20

1.	Scanner cache as we know - We always had scanner caching set to
1, but this is different than pre fetch limit
2.	hbase.client.prefetch.limit -  This is meta caching limit
defaults to 10 to prefetch 10 region locations every time we scan that
is not already been pre-warmed=20

the "hbase.client.prefetch.limit" is passed along to the client code to
prefetch the next 10 region locations.

int rows =3D Math.min(rowLimit,
configuration.getInt("hbase.meta.scanner.caching", 100));

the "row" variable mins to 10 and always prefetch atmost 10 region
boundaries. Hence every new region boundary that is not already been
pre-warmed fetch the next 10 region locations resulting in 1st slow
query followed by quick responses. This is basically pre-warming the
meta not region cache.

-----Original Message-----
From: Jeff Whiting [mailto:jeffw@qualtrics.com]=20
Sent: Wednesday, January 25, 2012 10:09 AM
To: user@hbase.apache.org
Subject: Re: Speeding up Scans

Does it make sense to have better defaults so the performance out of the
box is better?

~Jeff

On 1/25/2012 8:06 AM, Peter Wolf wrote:
> Ah ha!  I appear to be insane ;-)
>
> Adding the following speeded things up quite a bit
>
>         scan.setCacheBlocks(true);
>         scan.setCaching(1000);
>
> Thank you, it was a duh!
>
> P
>
>
>
> On 1/25/12 8:13 AM, Doug Meil wrote:
>> Hi there-
>>
>> Quick sanity check:  what caching level are you using?  (default is
1)  I
>> know this is basic, but it's always good to double-check.
>>
>> If "language" is already in the lead position of the rowkey, why use
the
>> filter?
>>
>> As for EC2, that's a wildcard.
>>
>>
>>
>>
>>
>> On 1/25/12 7:56 AM, "Peter Wolf"<opus111@gmail.com>  wrote:
>>
>>> Hello all,
>>>
>>> I am looking for advice on speeding up my Scanning.
>>>
>>> I want to iterate over all rows where a particular column (language)
>>> equals a particular value ("JA").
>>>
>>> I am already creating my row keys using that column in the first
bytes.
>>> And I do my scans using partial row matching, like this...
>>>
>>>      public static byte[] calculateStartRowKey(String language) {
>>>          int languageHash =3D language.length()>  0 ?
language.hashCode() :
>>> 0;
>>>          byte[] language2 =3D Bytes.toBytes(languageHash);
>>>          byte[] accountID2 =3D Bytes.toBytes(0);
>>>          byte[] timestamp2 =3D Bytes.toBytes(0);
>>>          return Bytes.add(Bytes.add(language2, accountID2),
timestamp2);
>>>      }
>>>
>>>      public static byte[] calculateEndRowKey(String language) {
>>>          int languageHash =3D language.length()>  0 ?
language.hashCode() :
>>> 0;
>>>          byte[] language2 =3D Bytes.toBytes(languageHash + 1);
>>>          byte[] accountID2 =3D Bytes.toBytes(0);
>>>          byte[] timestamp2 =3D Bytes.toBytes(0);
>>>          return Bytes.add(Bytes.add(language2, accountID2),
timestamp2);
>>>      }
>>>
>>>      Scan scan =3D new Scan(calculateStartRowKey(language),
>>> calculateEndRowKey(language));
>>>
>>>
>>> Since I am using a hash value for the string, I need to re-check the
>>> column to make sure that some other string does not get the same
hash
>>> value
>>>
>>>      Filter filter =3D new SingleColumnValueFilter(resultFamily,
>>> languageCol, CompareFilter.CompareOp.EQUAL,
Bytes.toBytes(language));
>>>      scan.setFilter(filter);
>>>
>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
on
>>> EC2.
>>>
>>> I think that this should be really fast, but it is not.  Any advice
on
>>> how to debug/speed it up?
>>>
>>> Thanks
>>> Peter
>>>
>>>
>>>
>>>
>>>
>>
>

--=20
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com