hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pan, Thomas" <th...@ebay.com>
Subject Re: Scan performance on a big table as combination of multiple logic tables
Date Sat, 18 Feb 2012 07:25:43 GMT

Jacques, thanks for the details on region size. We've observed that
regions per region server could skew big time at the table level. We do
have tool to balance regions. Still, it is sort of annoying to maintain
the balance. $0.02, -Thomas

On 2/17/12 2:46 PM, "Jacques" <whshub@gmail.com> wrote:

>You should be fine having multiple tables with high region counts.  I
>avoid making thousands of tables.  However, if you have three separate
>business needs, make three different tables.
>You seem to be starting with a perspective that there would be some kind
>issues with multiple tables.  Why do you think this exists?  You said
>"Otherwise, runtime tuning seems to add quite amount of operational cost."
>I'm not sure what you are thinking here and where your thoughts are coming
>from.  Additionally, if you have separate tables, then you can modify them
>differently (e.g. setting them to different region sizes if it makes
>sense-- for example, some of our tables have smaller region sizes so we'll
>have more maps rather than fewer when we run map reduce jobs).
>Regarding region size: the HTable v1 format in 0.90 and below suffered
>taking a long time to transition as individual regions got too big.  With
>0.92 and HTablev2 that isn't as much of a problem as I understand it.  If
>recall correctly, there are numerous organizations using 10gb regions with
>sucess-- (among others, I believe this what Yahoo reported they were using
>for their web crawl tables on their thousand node cluster).  While I
>haven't run any stats, I believe that there is negligible scan performance
>impact as region size grows.  There is definitely no  exponential negative
>performance impact.
>On Fri, Feb 17, 2012 at 10:55 AM, Pan, Thomas <thpan@ebay.com> wrote:
>> Vladimire and Jacques, Thanks for the information! Unless Hbase well
>> handles multiple big sized tables (relatively high region count) in one
>> cluster, it seems to me that one big table is the way to go. Otherwise,
>> runtime tuning seems to add quite amount of operational cost. That leads
>> to another question. Do we see big region size as an issue? If so,
>> the pivot point as region size grows further, the scan performance
>> to degrade exponentially?
>> On 2/15/12 4:11 PM, "Vladimir Rodionov" <vrodionov@carrieriq.com> wrote:
>> >10 tables are fine. 1000 are not, especially when one does table
>> >pre-splitting to increase write perf.
>> >
>> >Too many regions kill HBase.
>> >
>> >Best regards,
>> >Vladimir Rodionov
>> >Principal Platform Engineer
>> >Carrier IQ, www.carrieriq.com
>> >e-mail: vrodionov@carrieriq.com
>> >
>> >________________________________________
>> >From: Jacques [whshub@gmail.com]
>> >Sent: Wednesday, February 15, 2012 3:45 PM
>> >To: dev@hbase.apache.org
>> >Subject: Re: Scan performance on a big table as combination of multiple
>> >logic tables
>> >
>> >Out of curiosity,  what do you perceive as the benefit to having only
>> >table?  Are there reasons that you think one table would perform better
>> >than a few?
>> >
>> >If you're splitting data within a table because you'd otherwise have
>> >millions of tables, I understand that and would concur with Vladimir's
>> >approach below.  However, if you're really looking at 10 tables versus
>> >table, it seems like HBase is built exactly to make that work well
>> >than having to make all sorts of application level code to do what
>> >already does).
>> >
>> >thanks,
>> >Jacques
>> >
>> >On Wed, Feb 15, 2012 at 1:57 PM, Pan, Thomas <thpan@ebay.com> wrote:
>> >
>> >>
>> >> Since Hbase is tailored to handle one table very well, we are
>> >>to
>> >> put multiple tables into one big table but on different column family
>> >>sets.
>> >> Our use case is full table scan against single column value filters.
>> >> records from different "logical tables" are at different column
>> >>families,
>> >> could we speed up the scan performance by simply checking the column
>> >>family
>> >> referenced by these single column value filters first before really
>> >>going
>> >> through all the underlying K-V pairs? It would be great if the Hbase
>> >>code
>> >> is already coded that way.
>> >>
>> >>
>> >> $0.02,
>> >> Thomas
>> >>
>> >>
>> >
>> >Confidentiality Notice:  The information contained in this message,
>> >including any attachments hereto, may be confidential and is intended
>> >be read only by the individual or entity to whom this message is
>> >addressed. If the reader of this message is not the intended recipient
>> >an agent or designee of the intended recipient, please note that any
>> >review, use, disclosure or distribution of this message or its
>> >attachments, in any form, is strictly prohibited.  If you have received
>> >this message in error, please immediately notify the sender and/or
>> >Notifications@carrieriq.com and delete or destroy any copy of this
>> >message and its attachments.

View raw message