hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Meil <doug.m...@explorysmedical.com>
Subject Re: Speeding up Scans
Date Wed, 25 Jan 2012 19:32:59 GMT

Thanks Geoff!  No apology required, that's good stuff.  I'll update the
book with that param.




On 1/25/12 2:17 PM, "Geoff Hendrey" <ghendrey@decarta.com> wrote:

>Sorry for jumping in late, and perhaps out of context, but I'm pasting
>in some findings  (reported to this list by us a while back) that helped
>us to get scans to perform very fast. Adjusting
>hbase.client.prefetch.limit was critical for us.:
>========================
>It's even more mysterious than we think. There is lack of documentation
>(or perhaps lack of know how). Apparently there are 2 factors that
>decide the performance of scan.
>
>1.	Scanner cache as we know - We always had scanner caching set to
>1, but this is different than pre fetch limit
>2.	hbase.client.prefetch.limit -  This is meta caching limit
>defaults to 10 to prefetch 10 region locations every time we scan that
>is not already been pre-warmed
>
>the "hbase.client.prefetch.limit" is passed along to the client code to
>prefetch the next 10 region locations.
>
>int rows = Math.min(rowLimit,
>configuration.getInt("hbase.meta.scanner.caching", 100));
>
>the "row" variable mins to 10 and always prefetch atmost 10 region
>boundaries. Hence every new region boundary that is not already been
>pre-warmed fetch the next 10 region locations resulting in 1st slow
>query followed by quick responses. This is basically pre-warming the
>meta not region cache.
>
>-----Original Message-----
>From: Jeff Whiting [mailto:jeffw@qualtrics.com]
>Sent: Wednesday, January 25, 2012 10:09 AM
>To: user@hbase.apache.org
>Subject: Re: Speeding up Scans
>
>Does it make sense to have better defaults so the performance out of the
>box is better?
>
>~Jeff
>
>On 1/25/2012 8:06 AM, Peter Wolf wrote:
>> Ah ha!  I appear to be insane ;-)
>>
>> Adding the following speeded things up quite a bit
>>
>>         scan.setCacheBlocks(true);
>>         scan.setCaching(1000);
>>
>> Thank you, it was a duh!
>>
>> P
>>
>>
>>
>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>> Hi there-
>>>
>>> Quick sanity check:  what caching level are you using?  (default is
>1)  I
>>> know this is basic, but it's always good to double-check.
>>>
>>> If "language" is already in the lead position of the rowkey, why use
>the
>>> filter?
>>>
>>> As for EC2, that's a wildcard.
>>>
>>>
>>>
>>>
>>>
>>> On 1/25/12 7:56 AM, "Peter Wolf"<opus111@gmail.com>  wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am looking for advice on speeding up my Scanning.
>>>>
>>>> I want to iterate over all rows where a particular column (language)
>>>> equals a particular value ("JA").
>>>>
>>>> I am already creating my row keys using that column in the first
>bytes.
>>>> And I do my scans using partial row matching, like this...
>>>>
>>>>      public static byte[] calculateStartRowKey(String language) {
>>>>          int languageHash = language.length()>  0 ?
>language.hashCode() :
>>>> 0;
>>>>          byte[] language2 = Bytes.toBytes(languageHash);
>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>timestamp2);
>>>>      }
>>>>
>>>>      public static byte[] calculateEndRowKey(String language) {
>>>>          int languageHash = language.length()>  0 ?
>language.hashCode() :
>>>> 0;
>>>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>timestamp2);
>>>>      }
>>>>
>>>>      Scan scan = new Scan(calculateStartRowKey(language),
>>>> calculateEndRowKey(language));
>>>>
>>>>
>>>> Since I am using a hash value for the string, I need to re-check the
>>>> column to make sure that some other string does not get the same
>hash
>>>> value
>>>>
>>>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>>>> languageCol, CompareFilter.CompareOp.EQUAL,
>Bytes.toBytes(language));
>>>>      scan.setFilter(filter);
>>>>
>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
>on
>>>> EC2.
>>>>
>>>> I think that this should be really fast, but it is not.  Any advice
>on
>>>> how to debug/speed it up?
>>>>
>>>> Thanks
>>>> Peter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>-- 
>Jeff Whiting
>Qualtrics Senior Software Engineer
>jeffw@qualtrics.com
>
>



Mime
View raw message