hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Friso van Vollenhoven <fvanvollenho...@xebia.com>
Subject Re: scan performance improvement
Date Thu, 11 Nov 2010 13:08:56 GMT
Not that block size (that's the HDFS one), but the HBase block size. You set it at table creation
or it uses the default of 64K.

The description of hbase.client.scanner.caching says:
Number of rows that will be fetched when calling next
on a scanner if it is not served from memory. Higher caching values
will enable faster scanners but will eat up more memory and some
calls of next may take longer and longer times when the cache is empty.

That means that it will pre-fetch that number of rows, if the next row does not come from
memory. So if your rows are small enough to fit 100 of them in one block, it doesn't matter
whether you pre-fetch 1, 50 or 99, because it will only go to disk when it exhausts the whole
block, which sticks in block cache. So, it will still fetch the same amount of data from disk
every time. If you increase the number to a value that is certain to load multiple blocks
at a time from disk, it will increase performance.



On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:

> Yes , I thought about large number , so you said it depends on block size.
> Good point.
> 
> I have one recored ~ 4k ,
> block size is:
> 
> <property>
>  <name>dfs.block.size</name>
>  <value>268435456</value>
>  <description>HDFS blocksize of 256MB for large file-systems.
> </description>
> </property>
> 
> what is the number that I have choose? Assuming
> I am afraid that using number which is equal one block brings to
> socketTimeOutException? Am I write?
> 
> Thanks Oleg.
> 
> 
> 
> 
> On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
> fvanvollenhoven@xebia.com> wrote:
> 
>> How small is small? If it is bytes, then setting the value to 50 is not so
>> much different from 1, I suppose. If 50 rows fit in one block, it will just
>> fetch one block whether the setting is 1 or 50. You might want to try a
>> larger value. It should be fine if the records are small and you need them
>> all on the client side anyway.
>> 
>> It also depends on the block size, of course. When you only ever do full
>> scans on a table and little random access, you might want to increase that.
>> 
>> Friso
>> 
>> 
>> 
>> 
>> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
>> 
>>> Hi ,
>>>  To improve client performance I  changed
>>> hbase.client.scanner.caching from 1 to 50.
>>> After running client with new value( hbase.client.scanner.caching from =
>> 50
>>> ) it didn't improve execution time at all.
>>> 
>>> I have ~ 9 million small records.
>>> I have to do full scan  , so it brings all 9 million records to client .
>>> My assumption -- this change have to bring significant improvement , but
>> it
>>> is not.
>>> 
>>> Additional Information.
>>> I scan table which has 100 regions
>>> 5 server
>>> 20 map
>>> 4  concurrent map
>>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I
>> write?
>>> and how can I improve it
>>> 
>>> 
>>> I changed the value in all hbase-site.xml files and restart hbase.
>>> 
>>> Any suggestions.
>> 
>> 


Mime
View raw message