hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wayne <wav...@gmail.com>
Subject Re: Cluster Wide Pauses
Date Wed, 12 Jan 2011 17:03:00 GMT
Added: https://issues.apache.org/jira/browse/HBASE-3438.

On Wed, Jan 12, 2011 at 11:40 AM, Wayne <wav100@gmail.com> wrote:

> We are using 0.89.20100924, r1001068
>
> We are seeing see it during heavy write load (which is all the time), but
> yesterday we had read load as well as write load and saw both reads and
> writes stop for 10+ seconds. The region size is the biggest clue we have
> found from our tests as setting up a new cluster with a 1GB max region size
> and starting to load heavily we will see this a lot for long long time
> frames. Maybe the bigger file gets hung up more easily with a split? Your
> description below also fits in that early on the load is not balanced so it
> is easier to stop everything on one node as the balance is not great early
> on. I will file a JIRA. I will also try to dig deeper into the logs during
> the pauses to find a node that might be stuck in a split.
>
>
>
> On Wed, Jan 12, 2011 at 11:17 AM, Stack <stack@duboce.net> wrote:
>
>> On Tue, Jan 11, 2011 at 2:34 PM, Wayne <wav100@gmail.com> wrote:
>> >  We have very frequent cluster wide pauses that stop all reads and
>> writes
>> > for seconds.
>>
>> All reads and all writes?
>>
>> I've seen the pause too for writes.  Its something I've always meant
>> to look into.  Friso postulates one cause.  Another that we've talked
>> of is a region taking a while to come back on line after a split or a
>> rebalance for whatever reason.  Client loading might be 'random'
>> spraying over lots of random regions but they all get stuck waiting on
>> one particular region to come back online.
>>
>> I suppose reads could be blocked for same reason if all are trying to
>> read from the offlined region.
>>
>> What version of hbase are you using?  Splits should be faster in 0.90
>> now that the split daughters come up on the same region.
>>
>> Sorry I don't have a better answer for you.  Need to dig in.
>>
>> File a JIRA.  If you want to help out some, stick some data up in it.
>> Some suggestions would be to enable logging of when we lookup region
>> locations in client and then note when requests go to zero.  Can you
>> figure what region the clients are waiting on (if they are waiting on
>> any).  If you can pull out a particular one, try and elicit its
>> history at time of blockage.  Is it being moved or mid-split?  I
>> suppose it makes sense that bigger regions would make the situation
>> 'worse'.  I can take a look at it too.
>>
>> St.Ack
>>
>>
>>
>>
>> We are constantly loading data to this cluster of 10 nodes.
>> > These pauses can happen as frequently as every minute but sometimes are
>> not
>> > seen for 15+ minutes. Basically watching the Region server list with
>> request
>> > counts is the only evidence of what is going on. All reads and writes
>> > totally stop and if there is ever any activity it is on the node hosting
>> the
>> > .META. table with a request count of region count + 1. This problem
>> seems to
>> > be worse with a larger region size. We tried a 1GB region size and saw
>> this
>> > more than we saw actual activity (and stopped using a larger region size
>> > because of it). We went back to the default region size and it was
>> better,
>> > but we had too many regions so now we are up to 512M for a region size
>> and
>> > we are seeing it more again.
>> >
>> > Does anyone know what this is? We have dug into all of the logs to find
>> some
>> > sort of pause but are not able to find anything. Is this an wal hlog
>> roll?
>> > Is this a region split or compaction? Of course our biggest fear is a GC
>> > pause on the master but we do not have java logging turned on with the
>> > master to tell. What could possibly stop the entire cluster from working
>> for
>> > seconds at a time very frequently?
>> >
>> > Thanks in advance for any ideas of what could be causing this.
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message