Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-dev@hadoop.apache.org
Message-ID: <118390927.1228183424286.JavaMail.jira@brutus>
Date: Mon, 1 Dec 2008 18:03:44 -0800 (PST)
From: "Andrew Purtell (JIRA)" <jira@apache.org>
To: hbase-dev@hadoop.apache.org
Subject: [jira] Updated: (HBASE-1040) OOME does not cause graceful shutdown
 under some failure scenarios
In-Reply-To: <1798467071.1228183307437.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HBASE-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell updated HBASE-1040:
----------------------------------

    Description: 
Probably OOME related updates to trunk should be backported to 0.18 branch. I am seeing these exceptions on our cluster in output from tablemap/tablereduce jobs:

> java.io.IOException: java.lang.OutOfMemoryError: Java heap space
> at java.io.DataInputStream.readFull(DataInputSteram.java:175)
> at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
> at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
> at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
> at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
> at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
> at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:516)
> at org.apache.hadoop.hbase.regionserver.StoreFileScanner.getNext(StoreFileScanner.java:312)

When such OOMEs as above happen, the cluster does not recover without manual intervention. The regionservers sometimes go down after this, or sometimes do not and stay up in sick condition for a while. Regions go offline and remain unavailable.


  was:
Probably OOME related updates to trunk should be backported to 0.18 branch. I am seeing these exceptions on our cluster:

> java.io.IOException: java.lang.OutOfMemoryError: Java heap space
> at java.io.DataInputStream.readFull(DataInputSteram.java:175)
> at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
> at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
> at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
> at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
> at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
> at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:516)
> at org.apache.hadoop.hbase.regionserver.StoreFileScanner.getNext(StoreFileScanner.java:312)

When such OOMEs as above happen, the cluster does not recover without manual intervention. The regionservers sometimes go down after this, or sometimes do not and stay up in sick condition for a while. Regions go offline and remain unavailable.


> OOME does not cause graceful shutdown under some failure scenarios
> ------------------------------------------------------------------
>
>                 Key: HBASE-1040
>                 URL: https://issues.apache.org/jira/browse/HBASE-1040
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.18.1
>            Reporter: Andrew Purtell
>
> Probably OOME related updates to trunk should be backported to 0.18 branch. I am seeing these exceptions on our cluster in output from tablemap/tablereduce jobs:
> > java.io.IOException: java.lang.OutOfMemoryError: Java heap space
> > at java.io.DataInputStream.readFull(DataInputSteram.java:175)
> > at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
> > at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
> > at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
> > at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
> > at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
> > at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:516)
> > at org.apache.hadoop.hbase.regionserver.StoreFileScanner.getNext(StoreFileScanner.java:312)
> When such OOMEs as above happen, the cluster does not recover without manual intervention. The regionservers sometimes go down after this, or sometimes do not and stay up in sick condition for a while. Regions go offline and remain unavailable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.