hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Vigeant <mark.vige...@riskmetrics.com>
Subject RE: Smaller Region Size?
Date Wed, 23 Dec 2009 17:09:04 GMT
> The biggest legitimate reason to run smaller region size is if your
> data set is small (lets say 400mb) but highly accessed, so you want a
> good spread of regions across your cluster.

That's exactly it, my input dataset was 500MB total (~1,000,000 rows) and it was getting stored
as just one region on one regionserver.

In response to St. Ack, I don't think my regions are performing too many splits: the regionserver
logs mainly consist of the occasional ZooKeeper Connection error and these two repeatedly:

2009-12-22 15:21:50,415 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats:
Sizes: Total=6.556961MB (6875472), Free=792.61804MB (831120240), Max=799.175MB (837995712),
Counts: Blocks=0, Access=25755, Hit=0, Miss=25755, Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%,
Miss Ratio=100.0%, Evicted/Run=NaN

2009-12-22 15:20:35,073 DEBUG org.apache.hadoop.hbase.regionserver.Store: Skipping major compaction
of Message because one (major) compacted file only and elapsedTime 339624149ms is < ttl=9223372036854775807

You're suggesting the performance would be improved if the dataset was larger? What are other
parameters that can be fine-tuned to optimize based off data size?

Thanks
-Mark
-----Original Message-----
From: Ryan Rawson [mailto:ryanobjc@gmail.com]
Sent: Tuesday, December 22, 2009 11:28 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Smaller Region Size?

The biggest legitimate reason to run smaller region size is if your
data set is small (lets say 400mb) but highly accessed, so you want a
good spread of regions across your cluster.

Another is to run a larger region if you are having a huge table and
you want to keep absolute region count low. I am not 100% sold on this
yet.

I have a patch that can keep performance high during a highly split
table, by using parallel puts. This has been proven to keep aggregate
performance really high, and I hope it will make 0.20.3.

On Tue, Dec 22, 2009 at 2:31 PM, stack <stack@duboce.net> wrote:
> On Tue, Dec 22, 2009 at 8:57 AM, Mark Vigeant
> <mark.vigeant@riskmetrics.com>wrote:
>
>> J-D,
>>
>> I noticed that performance for uploading data into tables got a lot better
>> as I lowered the max file size -- but up until a certain point, where the
>> performance began slowing down again.
>>
>>
> Tell us more.  What kinda size changes did you make?  How many regions were
> created?  Is the slow down because table is splitting all the time?  If you
> study regionserver logs, can you make out what the regionservers are
> spending their times doing?
>
>
>
>> Is there a rule of thumb/formula/notion to rely on when setting this
>> parameter for optimal performance? Thanks!
>>
>>
> We have most experience running defaults.  Generally folks go up from the
> default size because they want to host more data in about same number or
> regions.  Going down from the default I've not seen much of.
>
> St.Ack
>

This email message and any attachments are for the sole use of the intended recipients and
may contain proprietary and/or confidential information which may be privileged or otherwise
protected from disclosure. Any unauthorized review, use, disclosure or distribution is prohibited.
If you are not an intended recipient, please contact the sender by reply email and destroy
the original message and any copies of the message as well as any attachments to the original
message.

Mime
View raw message