hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-616) " We slept XXXXXX ms, ten times longer than scheduled: 3000" happens frequently.
Date Thu, 27 Nov 2008 02:10:44 GMT

     [ https://issues.apache.org/jira/browse/HBASE-616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

stack updated HBASE-616:
------------------------

    Fix Version/s: 0.19.0

jgray ran his uploader. Oddly, all seemed to work fine and the upload seemed to have finished
nicely but then the gc log hung like this:

{code}
4702.183: [GC 723285K->644523K(903424K), 0.1020550 secs]
4741.964: [GC 751595K->674674K(903488K), 0.1030280 secs]
5386.336: [GC 781746K->698672K(903744K), 299.1547740 secs]
5685.743: [Full GC
{code}

It was hung after the emission of 'Full GC'.

Eventually it cleared with the GC having been hung for 400 seconds:

{code}
4643.716: [GC 693488K->616512K(897216K), 0.0775520 secs]
4702.183: [GC 723285K->644523K(903424K), 0.1020550 secs]
4741.964: [GC 751595K->674674K(903488K), 0.1030280 secs]
5386.336: [GC 781746K->698672K(903744K), 299.1547740 secs]
5685.743: [Full GC 698672K->694148K(903744K), 434.7056360 secs]
6161.052: [Full GC 789962K->630359K(903744K), 23.9435320 secs]
6197.831: [Full GC 737687K->478895K(903744K), 28.7396540 secs]
6261.250: [GC 586222K->481646K(903744K), 0.0385590 secs]
6305.213: [GC 588974K->481846K(905472K), 0.1706230 secs]
6359.292: [GC 591478K->482681K(906048K), 0.0147330 secs]
6469.725: [GC 592313K->482763K(907072K), 0.0059020 secs]
...
{code}

Looking in ganglia, the whole cluster had gone into heavy swap.  Jon looking at top, etc.,
saw that the machines had loads of free memory.  Could not explain why swap, low CPU, and
free memory combination.

Whatever the cause of cluster swap seems to be hanging GC upsetting hbase.

Looking in logs, nothing really obvious though hdfs and deletes are suspect (pset has reported
deletes being problematic in past, so has streamy).  See a bunch of deleted commit files around
that time but then looking at block deletes in datanode, they seem to be pretty constantly
deleting bunches over time.  Datanodes at the time were handling startup of a new MR job which
was opening loads of files.  

One thing to try is a rerun with delete of commit logs disabled.

> " We slept XXXXXX ms, ten times longer than scheduled: 3000" happens frequently.
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-616
>                 URL: https://issues.apache.org/jira/browse/HBASE-616
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.19.0
>
>
> Just saw the below in a log... all in a row on the one server.
> {code}
>    4493 2008-05-05 18:08:17,512 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 34557ms,
ten times longer than scheduled: 3000
>    4494 2008-05-05 18:11:08,879 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 30576ms,
ten times longer than scheduled: 3000
>    4495 2008-05-05 18:30:45,056 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 1091720ms,
ten times longer than scheduled: 3000
>    4496 2008-05-05 18:30:45,056 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 1094209ms,
ten times longer than scheduled: 10000
>    4497 2008-05-05 18:30:45,429 FATAL org.apache.hadoop.hbase.HRegionServer: unable to
report to master for 1092093 milliseconds - aborting server
> {code}
> We're seeing these kinda outages pretty frequently.  In the case above, it was small
cluster that was using TableReduce to insert.  The MR, HDFS and HBase were all running on
same nodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message