Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of jgray@facebook.com designates
 69.63.179.25 as permitted sender)
From: Jonathan Gray <jgray@facebook.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Date: Wed, 19 May 2010 08:05:36 -0700
Subject: Re: Optimal block size for large columns
Thread-Topic: Optimal block size for large columns
Thread-Index: Acr3ZMACR1ENh4DnRgS7o+eAOKancQ==
Message-ID: <D765D19A-EECB-456D-BA8B-024D68829A16@facebook.com>
References: <7647D1CF-0973-4D15-A140-E5E59D39C749@cumuluscode.com>
 <8D66B74984F9564BBB25C3C67D630F2D68457752@SC-MBXC1.TheFacebook.com>
 <106A5989-17AA-4D83-BD49-43761A32981E@cumuluscode.com>
In-Reply-To: <106A5989-17AA-4D83-BD49-43761A32981E@cumuluscode.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Currently every block requires another HDFS fetch.  There are open =20
jiras about prefetching all required blocks, in which case there would =20
be no difference.

Your best bet is to test and benchmark with varied block and row =20
sizes.  If you show big perf hits for multiple blocks, that would be a =20
good argument for getting prefetching implemented (at an already =20
largish size if 64k it's not clear how beneficial it will be).

Please share your findings if you do any more experimentation.

On May 18, 2010, at 6:43 PM, "Jason Strutz" <jason@cumuluscode.com> =20
wrote:

> Thanks for your response Jonathan.  We'll be doing largely single-=20
> row random lookups.  In this scenario, would it be best to try to =20
> make the block size encompass a single row?  How significant is the =20
> performance hit if hbase has to dig up multiple blocks to serve a =20
> singe row?
>
>
> On May 18, 2010, at 3:12 PM, Jonathan Gray wrote:
>
>> It would depend on your read patterns.
>>
>> Is everything going to be single row gets, or will you also scan?
>>
>> Single row lookups will be faster with smaller block sizes, at the =20
>> expense of a larger index size (and potentially slower scans as you =20
>> have to deal with more block fetches).
>>
>>> -----Original Message-----
>>> From: Jason Strutz [mailto:jason@cumuluscode.com]
>>> Sent: Tuesday, May 18, 2010 9:33 AM
>>> To: hbase-user@hadoop.apache.org
>>> Subject: Optimal block size for large columns
>>>
>>> I am working with a small cluster, trying to nail down appropriate
>>> settings for block size.  We will have a single table with a single
>>> column of data averaging 300k in size, sometimes upwards of 2mb, =20
>>> never
>>> more than 10mb.
>>>
>>> Is there any rule-of-thumb or other sage advice for block sizes for
>>> large columns?
>>>
>>> Thanks!
>