hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Wolf <opus...@gmail.com>
Subject Re: Speeding up Scans
Date Wed, 25 Jan 2012 19:58:16 GMT
Interesting,

I added this, and my scan did speed up somewhat

         conf.setInt("hbase.client.prefetch.limit",100);
         hTable = new HTable(conf, tableName);


What does this environment variable really control, and how should it be 
set to an appropriate value?  What is a region, and how does it map to 
lines, families and columns?  What are the tradeoffs for making it big?

Peter



On 1/25/12 2:32 PM, Doug Meil wrote:
> Thanks Geoff!  No apology required, that's good stuff.  I'll update the
> book with that param.
>
>
>
>
> On 1/25/12 2:17 PM, "Geoff Hendrey"<ghendrey@decarta.com>  wrote:
>
>> Sorry for jumping in late, and perhaps out of context, but I'm pasting
>> in some findings  (reported to this list by us a while back) that helped
>> us to get scans to perform very fast. Adjusting
>> hbase.client.prefetch.limit was critical for us.:
>> ========================
>> It's even more mysterious than we think. There is lack of documentation
>> (or perhaps lack of know how). Apparently there are 2 factors that
>> decide the performance of scan.
>>
>> 1.	Scanner cache as we know - We always had scanner caching set to
>> 1, but this is different than pre fetch limit
>> 2.	hbase.client.prefetch.limit -  This is meta caching limit
>> defaults to 10 to prefetch 10 region locations every time we scan that
>> is not already been pre-warmed
>>
>> the "hbase.client.prefetch.limit" is passed along to the client code to
>> prefetch the next 10 region locations.
>>
>> int rows = Math.min(rowLimit,
>> configuration.getInt("hbase.meta.scanner.caching", 100));
>>
>> the "row" variable mins to 10 and always prefetch atmost 10 region
>> boundaries. Hence every new region boundary that is not already been
>> pre-warmed fetch the next 10 region locations resulting in 1st slow
>> query followed by quick responses. This is basically pre-warming the
>> meta not region cache.
>>
>> -----Original Message-----
>> From: Jeff Whiting [mailto:jeffw@qualtrics.com]
>> Sent: Wednesday, January 25, 2012 10:09 AM
>> To: user@hbase.apache.org
>> Subject: Re: Speeding up Scans
>>
>> Does it make sense to have better defaults so the performance out of the
>> box is better?
>>
>> ~Jeff
>>
>> On 1/25/2012 8:06 AM, Peter Wolf wrote:
>>> Ah ha!  I appear to be insane ;-)
>>>
>>> Adding the following speeded things up quite a bit
>>>
>>>          scan.setCacheBlocks(true);
>>>          scan.setCaching(1000);
>>>
>>> Thank you, it was a duh!
>>>
>>> P
>>>
>>>
>>>
>>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>>> Hi there-
>>>>
>>>> Quick sanity check:  what caching level are you using?  (default is
>> 1)  I
>>>> know this is basic, but it's always good to double-check.
>>>>
>>>> If "language" is already in the lead position of the rowkey, why use
>> the
>>>> filter?
>>>>
>>>> As for EC2, that's a wildcard.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 1/25/12 7:56 AM, "Peter Wolf"<opus111@gmail.com>   wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I am looking for advice on speeding up my Scanning.
>>>>>
>>>>> I want to iterate over all rows where a particular column (language)
>>>>> equals a particular value ("JA").
>>>>>
>>>>> I am already creating my row keys using that column in the first
>> bytes.
>>>>> And I do my scans using partial row matching, like this...
>>>>>
>>>>>       public static byte[] calculateStartRowKey(String language) {
>>>>>           int languageHash = language.length()>   0 ?
>> language.hashCode() :
>>>>> 0;
>>>>>           byte[] language2 = Bytes.toBytes(languageHash);
>>>>>           byte[] accountID2 = Bytes.toBytes(0);
>>>>>           byte[] timestamp2 = Bytes.toBytes(0);
>>>>>           return Bytes.add(Bytes.add(language2, accountID2),
>> timestamp2);
>>>>>       }
>>>>>
>>>>>       public static byte[] calculateEndRowKey(String language) {
>>>>>           int languageHash = language.length()>   0 ?
>> language.hashCode() :
>>>>> 0;
>>>>>           byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>>           byte[] accountID2 = Bytes.toBytes(0);
>>>>>           byte[] timestamp2 = Bytes.toBytes(0);
>>>>>           return Bytes.add(Bytes.add(language2, accountID2),
>> timestamp2);
>>>>>       }
>>>>>
>>>>>       Scan scan = new Scan(calculateStartRowKey(language),
>>>>> calculateEndRowKey(language));
>>>>>
>>>>>
>>>>> Since I am using a hash value for the string, I need to re-check the
>>>>> column to make sure that some other string does not get the same
>> hash
>>>>> value
>>>>>
>>>>>       Filter filter = new SingleColumnValueFilter(resultFamily,
>>>>> languageCol, CompareFilter.CompareOp.EQUAL,
>> Bytes.toBytes(language));
>>>>>       scan.setFilter(filter);
>>>>>
>>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
>> on
>>>>> EC2.
>>>>>
>>>>> I think that this should be really fast, but it is not.  Any advice
>> on
>>>>> how to debug/speed it up?
>>>>>
>>>>> Thanks
>>>>> Peter
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>> -- 
>> Jeff Whiting
>> Qualtrics Senior Software Engineer
>> jeffw@qualtrics.com
>>
>>
>


Mime
View raw message