hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Geoff Hendrey" <ghend...@decarta.com>
Subject RE: Speeding up Scans
Date Wed, 25 Jan 2012 19:17:37 GMT
Sorry for jumping in late, and perhaps out of context, but I'm pasting
in some findings  (reported to this list by us a while back) that helped
us to get scans to perform very fast. Adjusting
hbase.client.prefetch.limit was critical for us.:
========================
It's even more mysterious than we think. There is lack of documentation
(or perhaps lack of know how). Apparently there are 2 factors that
decide the performance of scan. 

1.	Scanner cache as we know - We always had scanner caching set to
1, but this is different than pre fetch limit
2.	hbase.client.prefetch.limit -  This is meta caching limit
defaults to 10 to prefetch 10 region locations every time we scan that
is not already been pre-warmed 

the "hbase.client.prefetch.limit" is passed along to the client code to
prefetch the next 10 region locations.

int rows = Math.min(rowLimit,
configuration.getInt("hbase.meta.scanner.caching", 100));

the "row" variable mins to 10 and always prefetch atmost 10 region
boundaries. Hence every new region boundary that is not already been
pre-warmed fetch the next 10 region locations resulting in 1st slow
query followed by quick responses. This is basically pre-warming the
meta not region cache.

-----Original Message-----
From: Jeff Whiting [mailto:jeffw@qualtrics.com] 
Sent: Wednesday, January 25, 2012 10:09 AM
To: user@hbase.apache.org
Subject: Re: Speeding up Scans

Does it make sense to have better defaults so the performance out of the
box is better?

~Jeff

On 1/25/2012 8:06 AM, Peter Wolf wrote:
> Ah ha!  I appear to be insane ;-)
>
> Adding the following speeded things up quite a bit
>
>         scan.setCacheBlocks(true);
>         scan.setCaching(1000);
>
> Thank you, it was a duh!
>
> P
>
>
>
> On 1/25/12 8:13 AM, Doug Meil wrote:
>> Hi there-
>>
>> Quick sanity check:  what caching level are you using?  (default is
1)  I
>> know this is basic, but it's always good to double-check.
>>
>> If "language" is already in the lead position of the rowkey, why use
the
>> filter?
>>
>> As for EC2, that's a wildcard.
>>
>>
>>
>>
>>
>> On 1/25/12 7:56 AM, "Peter Wolf"<opus111@gmail.com>  wrote:
>>
>>> Hello all,
>>>
>>> I am looking for advice on speeding up my Scanning.
>>>
>>> I want to iterate over all rows where a particular column (language)
>>> equals a particular value ("JA").
>>>
>>> I am already creating my row keys using that column in the first
bytes.
>>> And I do my scans using partial row matching, like this...
>>>
>>>      public static byte[] calculateStartRowKey(String language) {
>>>          int languageHash = language.length()>  0 ?
language.hashCode() :
>>> 0;
>>>          byte[] language2 = Bytes.toBytes(languageHash);
>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>          return Bytes.add(Bytes.add(language2, accountID2),
timestamp2);
>>>      }
>>>
>>>      public static byte[] calculateEndRowKey(String language) {
>>>          int languageHash = language.length()>  0 ?
language.hashCode() :
>>> 0;
>>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>          return Bytes.add(Bytes.add(language2, accountID2),
timestamp2);
>>>      }
>>>
>>>      Scan scan = new Scan(calculateStartRowKey(language),
>>> calculateEndRowKey(language));
>>>
>>>
>>> Since I am using a hash value for the string, I need to re-check the
>>> column to make sure that some other string does not get the same
hash
>>> value
>>>
>>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>>> languageCol, CompareFilter.CompareOp.EQUAL,
Bytes.toBytes(language));
>>>      scan.setFilter(filter);
>>>
>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
on
>>> EC2.
>>>
>>> I think that this should be really fast, but it is not.  Any advice
on
>>> how to debug/speed it up?
>>>
>>> Thanks
>>> Peter
>>>
>>>
>>>
>>>
>>>
>>
>

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com


Mime
View raw message