Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 9016 invoked from network); 19 May 2010 15:07:51 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 19 May 2010 15:07:51 -0000 Received: (qmail 79980 invoked by uid 500); 19 May 2010 15:07:50 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 79915 invoked by uid 500); 19 May 2010 15:07:50 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 79907 invoked by uid 99); 19 May 2010 15:07:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 May 2010 15:07:50 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jgray@facebook.com designates 69.63.179.25 as permitted sender) Received: from [69.63.179.25] (HELO mailout-snc1.facebook.com) (69.63.179.25) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 May 2010 15:07:43 +0000 Received: from mail.thefacebook.com ([192.168.18.104]) by pp01.snc1.tfbnw.net (8.14.3/8.14.3) with ESMTP id o4JF6wHK003074 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=NOT) for ; Wed, 19 May 2010 08:07:02 -0700 Received: from SC-MBXC1.TheFacebook.com ([192.168.18.102]) by sc-hub01.TheFacebook.com ([192.168.18.104]) with mapi; Wed, 19 May 2010 08:05:42 -0700 From: Jonathan Gray To: "user@hbase.apache.org" Date: Wed, 19 May 2010 08:05:36 -0700 Subject: Re: Optimal block size for large columns Thread-Topic: Optimal block size for large columns Thread-Index: Acr3ZMACR1ENh4DnRgS7o+eAOKancQ== Message-ID: References: <7647D1CF-0973-4D15-A140-E5E59D39C749@cumuluscode.com> <8D66B74984F9564BBB25C3C67D630F2D68457752@SC-MBXC1.TheFacebook.com> <106A5989-17AA-4D83-BD49-43761A32981E@cumuluscode.com> In-Reply-To: <106A5989-17AA-4D83-BD49-43761A32981E@cumuluscode.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166 definitions=2010-05-19_01:2010-02-06,2010-05-19,2010-05-19 signatures=0 X-Virus-Checked: Checked by ClamAV on apache.org Currently every block requires another HDFS fetch. There are open =20 jiras about prefetching all required blocks, in which case there would =20 be no difference. Your best bet is to test and benchmark with varied block and row =20 sizes. If you show big perf hits for multiple blocks, that would be a =20 good argument for getting prefetching implemented (at an already =20 largish size if 64k it's not clear how beneficial it will be). Please share your findings if you do any more experimentation. On May 18, 2010, at 6:43 PM, "Jason Strutz" =20 wrote: > Thanks for your response Jonathan. We'll be doing largely single-=20 > row random lookups. In this scenario, would it be best to try to =20 > make the block size encompass a single row? How significant is the =20 > performance hit if hbase has to dig up multiple blocks to serve a =20 > singe row? > > > On May 18, 2010, at 3:12 PM, Jonathan Gray wrote: > >> It would depend on your read patterns. >> >> Is everything going to be single row gets, or will you also scan? >> >> Single row lookups will be faster with smaller block sizes, at the =20 >> expense of a larger index size (and potentially slower scans as you =20 >> have to deal with more block fetches). >> >>> -----Original Message----- >>> From: Jason Strutz [mailto:jason@cumuluscode.com] >>> Sent: Tuesday, May 18, 2010 9:33 AM >>> To: hbase-user@hadoop.apache.org >>> Subject: Optimal block size for large columns >>> >>> I am working with a small cluster, trying to nail down appropriate >>> settings for block size. We will have a single table with a single >>> column of data averaging 300k in size, sometimes upwards of 2mb, =20 >>> never >>> more than 10mb. >>> >>> Is there any rule-of-thumb or other sage advice for block sizes for >>> large columns? >>> >>> Thanks! >