hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: region servers crashing
Date Thu, 26 Aug 2010 19:32:27 GMT
Without gc logs you cannot diagnose what you suspect are gc issues...
make sure you are logging and then check them out.  If you are running
a recent JVM you can use -XX:+PrintGCDateStamps and get better log
entries.

Also you cannot swap at all, even 1 page of swapping in a java process
can be killer.  Combined with the hypervisor stealing your CPU you can
have a lot of elapsed wall time with not very many cpu slices being
executed.  Consider vmstat and top to diagnose that one issue.

On the GC issue, the one setting you are using which is initiating
occupancy fraction is set kind of low. This means you will kick in the
GC once you hit 50% of your memory usage.  You might consider testing
with that set to a medium level, say 75% or so.

-ryan

On Thu, Aug 26, 2010 at 12:17 PM, Dmitry Chechik <dmitry@tellapart.com> wrote:
> Hi all,
> We're still seeing these crashes pretty frequently. Attached is the error
> from the regionserver logs as well as a GC dump of the last hour of the
> regionserver:
> 2010-08-26 13:34:10,855 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 157041ms, ten times longer than scheduled: 10000
> 2010-08-26 13:34:10,925 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 148602ms, ten times longer than scheduled: 1000
> 2010-08-26 13:34:10,925 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to
> master for 148602 milliseconds - retrying
> Since our workload is mostly scans in mapreduce, we've turned off block
> caching as per https://issues.apache.org/jira/browse/HBASE-2252 in case that
> had anything to do with it.
> We've also decreased NewSize and MaxNewSize and decreased
> CMSInitiatingOccupancyFraction, so our GC settings now are:
> -Xmx2000m
>   -XX:+UseConcMarkSweepGC
>   -XX:CMSInitiatingOccupancyFraction=50
>
>
>
>   -XX:NewSize=32m
>
>
>
>   -XX:MaxNewSize=32m
>
>
>
>   -XX:+DoEscapeAnalysis
>   -XX:+AggressiveOpts
>   -verbose:gc
>   -XX:+PrintGCDetails
>   -XX:+PrintGCTimeStamp
> We're running with 2G of RAM.
> Is the solution here only to move to machines with more RAM, or are there
> other GC settings we should look at?
> Thanks,
> - Dmitry
> On Wed, Jul 14, 2010 at 4:39 PM, Dmitry Chechik <dmitry@tellapart.com>
> wrote:
>>
>> We're running with 1GB of heap space.
>> Thanks all - we'll look into GC tuning some more.
>>
>> On Wed, Jul 14, 2010 at 3:47 PM, Jonathan Gray <jgray@facebook.com> wrote:
>>>
>>> This doesn't look like a clock skew issue.
>>>
>>> @Dmitry, while you should be running CMS, this is still a garbage
>>> collector and is still vulnerable to GC pauses.  There are additional
>>> configuration parameters to tune even more.
>>>
>>> How much heap are you running with on your RSs?  If you are hitting your
>>> servers with lots of load you should run with 4GB or more.
>>>
>>> Also, having ZK on the same servers as RS/DN is going to create problems
>>> if you're already hitting your IO limits.
>>>
>>> JG
>>>
>>> > -----Original Message-----
>>> > From: Arun Ramakrishnan [mailto:aramakrishnan@languageweaver.com]
>>> > Sent: Wednesday, July 14, 2010 3:33 PM
>>> > To: user@hbase.apache.org
>>> > Subject: RE: region servers crashing
>>> >
>>> > Had a problem that caused issues that looked like this.
>>> >
>>> > > 2010-07-12 15:10:03,299 WARN org.apache.hadoop.hbase.util.Sleeper:
We
>>> > slept
>>> > > 86246ms, ten times longer than scheduled: 1000
>>> >
>>> > Our problem was with clock skew. We just had to make sure ntp was
>>> > running on all machines and also the timezones detected on all the
>>> > machines were the same.
>>> >
>>> > -----Original Message-----
>>> > From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-
>>> > Daniel Cryans
>>> > Sent: Wednesday, July 14, 2010 3:11 PM
>>> > To: user@hbase.apache.org
>>> > Subject: Re: region servers crashing
>>> >
>>> > Dmitry,
>>> >
>>> > Your log shows this:
>>> >
>>> > > 2010-07-12 15:10:03,299 WARN org.apache.hadoop.hbase.util.Sleeper:
We
>>> > slept
>>> > > 86246ms, ten times longer than scheduled: 1000
>>> >
>>> > This is a pause that lasted more than a minute, the process was in
>>> > that state (GC, swapping, mix of all of them) for some reason and it
>>> > was long enough to expire the ZooKeeper session (since from its point
>>> > of view the region server stopped responding).
>>> >
>>> > The NPE is just a side-effect, it is caused by the huge pause.
>>> >
>>> > It's well worth upgrading, but it won't solve your pausing issues. I
>>> > can only recommend closer monitoring, setting swappiness to 0 and
>>> > giving more memory to HBase (if available).
>>> >
>>> > J-D
>>> >
>>> > On Wed, Jul 14, 2010 at 3:03 PM, Dmitry Chechik <dmitry@tellapart.com>
>>> > wrote:
>>> > > Hi all,
>>> > > We've been having issues for a few days with HBase region servers
>>> > crashing
>>> > > when under load from mapreduce jobs.
>>> > > There are a few different errors in the region server logs - I've
>>> > attached a
>>> > > sample log of 4 different region servers crashing within an hour of
>>> > each
>>> > > other.
>>> > > Some details:
>>> > > - This happens when a full table scan from a mapreduce is in
>>> > progress.
>>> > > - We are running HBase 0.20.3, with a 16-slave cluster, on EC2.
>>> > > - Some of the region server errors are NPEs which look a lot
>>> > > like https://issues.apache.org/jira/browse/HBASE-2077. I'm not sure
>>> > if that
>>> > > is the exact problem or if this issue is fixed in 0.20.5. Is it worth
>>> > > upgrading to 0.20.5 to fix this?
>>> > > - Some of the region server errors are scanner lease expired errors:
>>> > > 2010-07-12 15:10:03,299 WARN org.apache.hadoop.hbase.util.Sleeper:
We
>>> > slept
>>> > > 86246ms, ten times longer than scheduled: 1000
>>> > > 2010-07-12 15:10:03,299 WARN org.apache.zookeeper.ClientCnxn:
>>> > Exception
>>> > > closing session 0x229c72b89360001 to
>>> > sun.nio.ch.SelectionKeyImpl@7f712b3a
>>> > > java.io.IOException: TIMED OUT
>>> > >         at
>>> > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>>> > > 2010-07-12 15:10:03,299 INFO
>>> > > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
>>> > > 1779060682963568676 lease expired
>>> > > 2010-07-12 15:10:03,406 ERROR
>>> > > org.apache.hadoop.hbase.regionserver.HRegionServer:
>>> > > org.apache.hadoop.hbase.UnknownScannerException: Name:
>>> > 1779060682963568676
>>> > >         at
>>> > >
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.j
>>> > ava:1877)
>>> > >         at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown
>>> > Source)
>>> > >         at
>>> > >
>>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso
>>> > rImpl.java:25)
>>> > >        at java.lang.reflect.Method.invoke(Method.java:597)
>>> > >         at
>>> > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>> > > We tried increasing hbase.regionserver.lease.period to 2 minutes but
>>> > that
>>> > > didn't seem to make a difference here.
>>> > > - Our configuration and table size haven't changed significantly in
>>> > those
>>> > > days.
>>> > > - We're running a 3-node Zookeeper cluster collocated on the same
>>> > machines
>>> > > as the HBase/Hadoop cluster.
>>> > > - Based on Ganglia output, it doesn't look like the regionservers (or
>>> > any of
>>> > > the machines) are swapping.
>>> > > - At the time of the crash, it doesn't appear that the network was
>>> > > overloaded (i.e. we've seen higher network traffic without crashes).
>>> > So it
>>> > > doesn't seem that this is a problem communicating with Zookeeper.
>>> > > - We have "-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode" enabled,
>>> > so it
>>> > > doesn't seem like we should be pausing due to GC too much.
>>> > > Any thoughts?
>>> > > Thanks,
>>> > > - Dmitry
>>
>
>
>

Mime
View raw message